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FEEF/^.CE 

This  report  presents  the  detailed  design  of  the  OMEGA 
system,  a  data  base  management  system  based  on  the  relational 
model  and  designed  to  run  on  a  PDF  11/45  using  the  UNIX  operating 
system.  The  OMEGA  project  was  initiated  by  D.  Tsichritzis  and  is 
based  cn  previous  experience  with  the  ZETA  project  [Erodie  et  al. 
1975].  Many  of  the  ideas  expressed  herein  resulted  from 
discussions  with  C.  Gotliet,  S.  Schuster,  D.  Tsichritzis,  M. 
Erodie,  J.  Klebanoff,  and  F.  Lochovsky.  The  system  itself  was 
designed  during  the  period  December  1974  to  July  1975  at  the 
University  of  Toronto  by  H.  A.  Schmid  and  P.  A.  Eernstein.  The 
i nplementation  of  the  system  was  begun  over  the  summer  of  1975  by 
E.  Arlow,  E.  Eaker,  and  S.  Pozgaj,  under  the  supervision  of  H.  A. 
Schmid . 


Since  the  goal  of  the  project  later  changed  towards  a 
system  supporting  hierarchical,  network,  and  relational  views  of 
data,  and,  consequently,  involved  some  internal  changes,  we 
describe  the  system  as  of  summer  1975. 
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1.1  GCALS 


1.  OVEEVIEW 


Cur  main  goal  in  implementing  a  relational  system  on  a 
small  computer  vas  to  develop  a  fairly  general  and  still 
generalizable  approach  to  organizing  a  relational  data  base 
management  system,  paying  particular  attention  to  considerations 
of  structuring  and  efficiency.  We  wanted  the  system  to  be  small 
and  clean.  We  wanted  it  to  be  describable  in  a  well-structured 
tcp-down  manner  that  leads  easily  to  a  structured  implementation. 
Our  mcdels  for  such  a  system  were  the  small  structured  operating 
systems  that  have  been  developed  using  top-dcwn  principles,  such 
as  EC4000,  UNIX,  and  Venus. 

The  system  that  we  describe  here  implements  a  subset  of 
the  concepts  reported  in  [Schmid  and  Bernstein  75].  It  is 
designed  tc  be  extensible  so  that  it  could  eventually  include  all 
of  these  concepts.  However,  we  do  not  yet  know  whether  all  of 
these  features  are  desirable  on  a  small  computer. 

Since  we  concentrated  mainly  on  structural  and 
architectural  problems,  we  will  not  treat  questions  of 
concurrency,  protection,  and  related  issues  in  this  report.  But 
we  feel  that  these  problems  can  be  solved  much  easier  in  a  data 
base  system  with  a  clean  architecture  and  structure  where 
differing  problems  can  be  solved  on  dfferent  system  levels. 

1.2  BASIC  EESIGN  ASSUMPTIONS 

Our  physical  organization  is  restricted  by  two 
assumptions. 

One  major  assumption  is  that  each  relation  is 
implemented  more  or  less  by  a  direct  access  file.  That  is,  we 
propose  that  all  tuples  of  one  system  relation  are  stored 
contiguously  as  records  of  one  file.  (We  will  weaken  this 
constraint  somewhat  in  a  later  section,  but  the  reader  should 
take  it  as  a  strict  rule  for  now.)  We  have  chosen  this  pure- file 
approach  primarily  because  it  is  quite  easy  to  implement,  since 
it  allows  existing  file  systems  to  be  used  without  modification 
as  the  lowest  software  level  of  the  data  base  system. 

An  alternative  approach  is  to  store  records  that 
correspond  to  tuples  of  different  relations  in  adjacent  locations 
of  a  single  file.  This  hvbr id-file  aproach  is  an  acceptable  and 
widely  used  file  organization,  which  makes  a  certain  class  of 
queries  (i.e.,  queries  based  mainly  on  joins)  very  efficient  to 
interpret.  However,  it  requires  rather  complex  storage 
organization  techniques  to  be  incorporated  in  the  system.  Given 
the  limitations  on  the  size  of  our  system,  we  decided  not  to 
implement  this  approach  in  addition  to  the  pure  file  approach. 

A  second  assumption  is  that  we  do  not  want  to  manipulate 
the  physical  tuples  of  a  relation  (or  the  records  of  a  file 
equivalent  to  them  on  a  lower  level)  in  the  same  way  as  the 
relational  operators  would  if  they  were  implemented  in  a 
straightforward  way  from  their  definitions.  For  example,  a 
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restriction  (say)  on  a  relation  should  not  involve  really 
removing  the  tuples  which  do  not  fulfill  the  qualification. 
Instead,  we  want  to  use  pointers  to  tuples,  because  they  require 
considerably  less  space  and  are  easier  tc  handle  than  the  tuples 
themselves.  (P.  pointer  will  usually  be  a  logical  address  (i.e., 
an  index)  of  a  record  within  a  file,  rather  than  an  address  of 
the  record  in  physcial  storage.)  We  will  use  the  term  subset 
element  to  denote  any  data  structure  that  specifies  a  subset  of 
the”tuples  of  a  relation.  For  example,  if  a  relation  is  stored  as 
a  file,  a  subset  element  could  be  an  array  of  pointers  or  a  bit 
map  (or  any  other  data  structure)  that  specifies  a  subset  of  the 
records  (i.e.,  tuples)  of  the  file  (i.e.,  relation).  The  result 
of  a  restriction  (say)  is  then  represented  by  a  subset  element 
identifying  all  the  desired  tuples. 

This  approach  in  combination  with  the  pure  file  aproach 
yields  a  significant  savings  on  disk  accesses.  We  generally  sort 
the  pointers  to  the  records  according  to  their  physical  sequence. 
(Since  we  implement  subset  elements  by  B-trees  as  we  will  see  in 
section  3,  sorting  need  not  be  done  explicitly,  but  will  be  just 
dene  as  a  side  effect.)  When  we  access  more  than  a  certain  small 
percentage  (which  depends  on  the  block  size)  of  a  file,  most  disk 
accesses  will  return  more  than  one  of  the  records  we  want  to 
retrieve.  (This  is  estimated  using  a  mathematical  model  in 
[Schmid  and  Bernstein  75]). 

In  some  special  cases,  this  approach  may  cause  seme  loss 
of  efficiency.  Still,  we  have  decided  to  handle  accesses 
uniformly  according  to  this  method.  We  feel  the  loss  of 
efficiency  in  these  special  cases  to  be  unimportant  (for  the 
intended  applications  of  our  system)  compared  to  the  simplicity 
of  a  uniform  approach. 

1.3  SYSTEM  STBUCTDEE 

Most  relational  systems  directly  interpret  (an  internal 
ferm  of)  statements  expressed  in  a  user-oriented  query  language 
(e.g.,  SEQUEL  [Chamberlin  and  Boyce  74],  QUEL  [Held  et  al.  75]). 
But  it  seems  to  be  difficult,  if  not  impossible,  to  find  a 
language  which  is  both  user-oriented  and  well  suited  for  an 
efficient  interpretation.  Therefore,  we  have  decided  not  to 
directly  interpret  the  user  language.  Bather,  we  use  a  system 
level  relational  language  that  is  high  level  enough  that 
translation  from  the  user  language  into  it  is  net  difficult  and 
that  at  the  same  time  is  oriented  towards  an  efficient 
i nplementation. 

Thus,  we  introduce  in  our  system  a  level  called  the 
system  level.  It  consists  of  a  systep  schema,  nowadays  called  a 
conceptual  schema,  and  a  system  level  language.  The  system 
schema  is  the  basis  on  which  all  user  views  are  defined.  In  our 
case,  since  we  want  to  provide  mainly  relational  user  views,  the 
system  schema  is  represented  by  relations,  the  system  relations. 
The  system  level  language  is  a  relational  algebraic  language  with 
special  features  for  handling  fast  access  paths.  It  is  described 
in  detail  in  section  2. 
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1.3.1  User  Level 

Operations  expressed  in  a  user's  relational  language  on 
a  special  user  view  have  to  be  translated  into  the  system  level. 
To  do  this,  we  need  two  mappings:  one  indicates  the  way  in  which 
the  user  language  is  translated  into  the  system  level  language; 
the  second  indicates  the  way  in  which  a  user  view  is  transformed 
into  the  system  relations. 

In  our  first  implementation,  described  in  this  report, 
we  support  a  user  language  that  is  nearly  identical  to  the  system 
level  language,  and  we  provide  only  one  user  view  that  is 
identical  to  the  system  schema.  Therefore,  our  translation  from 
the  user  language  into  the  system  level  language  is  mainly  a 
translation  from  a  user-oriented  form  into  an  internal  tree  form 
that  is  well  suited  for  optimization  purposes  and  other 
manipulation.  We  will  not  discuss  problems  of  user  views  in  the 
sequel . 

1.3.2  System  Level 

The  system  level  does  not  make  any  statements  about  the 
phsyical  organization  of  the  data  base.  This  information  is 
contained  on  the  lower  software  levels.  Consequently,  the 
physical  organization  can  be  modified,  according  to  changing 
access  characteristics  and  other  user  requirements  without  the 
system  level  being  influenced.  This  means  that  a  query  is 
expressed  in  our  system  level  language  without  paying  any 
attention  to  what  the  underlying  physical  organization  looks 
like. 


The  system  level  interface  is  supported  by  the  access 
The  access  structure  software  executes  on  a 
virtual  machine  called  the  virtual  file  and  data  structure  level. 
This  interface  supports  virtual  files,  which  are  ordered 
collections  cf  records,  and  data  structures,  which  are  used  to 
implement  fast  access  paths.  The  access  structure  software 

performs  three  main  functions  on  this  interface;  it  maps  each 
system  relation  into  exactly  one  virtual  file;  it  maintains  fast 
access  paths  to  speed  up  the  processing  of  some  queries;  and  it 
uses  the  data  manipulation  commands  on  virtual  files  and  data 
structures  to  access  the  data  base  while  interpreting  the  system 
level  language. 

There  are  two  general  kinds  of  access  paths;  cne  that 
supports  selection  (i.e.,  an  operation  within  one  relation)  and 
the  other  that  supports  linking  (i.e.,  an  operation  between  two 
relations) .  A  variety  of  access  paths  of  different  speeds  and 
maintenance  effort  have  been  described  [Schmid  and  Bernstein  75]. 
Since  our  system  is  implemented  on  a  minicomputer,  we  limited 
ourselves  to  inverted  files  (=  primary  and  secondary  indices)  as 
the  only  kind  cf  access  path  structure.  Inverted  files  can 
support  both  selection  and  link  access  paths. 
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1.3.3  Virtual  File  and  Data  Structure  Level 

The  virtual  file  and  data  structure  level  is  supported 
by  the  file  ma£  and  data  structure  software.  This  software  must 
map  each  virtual  file  and  data  structure  onto  one  or  more 
physical  files  that  are  actually  stored. 

A  single  virtual  file  can  be  stored  as  one  or  more 
physical  files.  For  example,  to  improve  disk  performance  for 
seme  queries  by  increasing  the  number  of  records  per  buffer,  a 
single  virtual  file  can  be  split  into  several  transposed  files. 
(Transposed  files  are  files  where  the  i-th  physical  record  of 
each  of  the  files  belong  together  as  a  single  logical  record.  No 
key  duplication  is  required  in  each  of  the  files.)  The  ability  to 
change  physical  structures  as  access  characteristics  change  is  an 
important  requirement  of  any  data  base  system  architecture 
[Schmid  and  Eernstein  75].  The  particular  transposed  file 
mapping  just  mentioned  is  to  be  included  in  our  system,  since  it 
is  rather  easy  to  implement.  Other  mappings  are  more  difficult 
tc  implement. 

There  are  many  different  data  structures  that  can 
support  inverted  files.  The  commands  supported  by  the  virtual 
files  and  data  structures  level  tc  access  inverted  files  can  be 
stated  without  knowing  what  data  structure  is  actually  used  to 
implement  the  inverted  file.  Knowledge  of  which  data  structure 
is  used  to  implement  a  particular  inverted  file  is  required  only 
fer  special  optimization  decisions  where  one  of  several  possible 
inverted  files  has  to  be  chosen  tc  make  the  actual  access. 

The  only  data  structure  that  we  provide  at  the  moment  to 
implement  inverted  files  are  five  different  types  cf  E-trees, 
since  E-trees  show  the  best  average  performance  over  a  wide  range 
of  conditions  (i.e.,  many  updates,  or  many  insertions,  or  neither 
cr  both,  etc.)  as  shown  in  [ Eayer  74]. 

The  concepts  of  user  view,  system  schema,  virtual  file, 
and  physical  file  lead  to  a  four  level  virtual  machine  structure 
as  shewn  in  figure  1. 

Section  2  contains  a  description  of  the  user  level  and 
system  level  languages.  A  general  description  of  the  interfaces 
and  the  software  that  implement  them  appears  in  section  3.  A 
mere  detailed  specification  of  our  programs  that  implement  these 
levels  appears  in  section  4. 
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Figure  1 

Multi* level  Data  lase  Architecture 


level  4:  Users 

level  3  interface  - — — -  relational  query  language  on 

user  relations 

level  3:  User  Level  Software 

level  2  interface  - - —  system  level  language  and 

DBA  commands  on  system  relations 

level  2:  Access  Structure  Software 

level  1  interface  — - -  operations  on  virtual  files  and 

data  structures 

level  1:  File  Map  and  Data  Structure  Software 

level  0  interface  — - - — - file  operations  on  physical 

files 
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2.  THE  SYSTEH  LEVEL  LANGDAGE  AND  OSEB  QDEEI  LANGUAGE 
2.1  THE  SYSTEM  LEVEL  LANGUAGE  —  LSL 

A  query  specification  in  the  system  level  language  is 
interpreted  by  level  2  of  the  system  (see  fig.  1).  The  language 
consists  of  essentially  two  kinds  of  operations,  which  correspond 
tc  the  restriction  and  join  of  relational  algebra.  One  is  called 
a  selector.  It  works  within  one  relation,  and  selects  a  subset  of 
the  relation  satisfying  a  given  boolean  qualification.  The  other 
operation  is  a  link.  It  makes  a  connection  between  two  relations, 
and  returns  a  collection  of  indices  identifying  the  connected 
tuples.  Both  of  these  operations  can  be  optimized  by  lower  system 
levels  in  any  way. 

Our  system  is  designed  so  that  the  processing  of 
selectors  and  links  can  be  efficiently  supported  by  auxiliary 
access  structures  that  are  stored  in  the  data  base. 

The  link  and  selector  operations  have  been  integrated 
into  a  formal  language  [Tsichritzis  75]  called  the  link  and 
selector  j.anquaqe  (abbr.  LS|.)  •  LSL  is  an  expression  oriented 
language.  Each  expression  represents  a  subset  of  the  tuples  of 
one  relation.  Therefore,  the  result  of  the  evaluation  of  an 
exression  can  be  represented  by  a  subset  element. 

A  selector  operation  applied  to  an  expression  (that  is, 
to  a  subset  of  a  relation)  yields  those  tuples  in  the  subset  that 
satisfj  a  boolean  qualification.  A  lifik  operation  makes  a 
connection  between  two  relations  (called  the  source  and  target) 
based  on  a  qualification  predicate  over  the  domains  of  both 
relations.  A  link  applied  to  an  expression  on  (i.e.,  to  a  subset 
of)  the  source  relation  yields  a  subset  of  the  target  relation 
containing  those  tuples  that  satisfy  the  associated  predicate  for 
some  tuple  in  the  source  relation  expression. 

Consider  the  relations  MANAGER  and  EMPLOYEE  (we  will  use 
this  schema  later  on  as  well) ; 

MANAGER  (MGR« , NAME, SAL , DEPT* , REGION) 

EMPLOYEE  (EMP#, NAME, AD DRESS, SAL, JOEDESC, MGR#) 

Suppose  X  is  a  subset  element  denoting  the  expression  resulting 
from  applying  a  selector  to  MANAGER  on  the  boolean  qualification 
predicate  REGICN="TOEONTO".  A  link  from  MANAGER  to  EMPLOYEE  where 
the  qualification  is  MANAGER. MGR#  =  EMPLOYEE. MGR#  delivers  for  a 
given  MANAGER-tuple  (i.e.,  a  given  MGR#)  all  those  EMPLOYEE- 
tuples  that  are  associated  with  this  manager.  The  expression 
resulting  from  applying  this  link  to  X  might  be  interpreted  as 
the  subset  of  EMPLOYEE  specifying  employees  working  for  managers 
located  in  Toronto. 

In  addition  to  links  and  selectors,  unions  and 
intersections  of  expressions  are  required  and,  of  course,  built- 
in  functions.  Projection  is  provided  merely  as  an  output 
operator.  Nevertheless,  the  LSL  has  been  shown  to  be  relationally 
complete  in  a  slightly  weakened  sense  [Tsichritzis  75]. 

In  addition  to  links  and  selectors,  the  system  level 
language  provides  facilities  for  updating  relations,  creating 
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inverted  files,  etc.  Details  of  these  commands  are  given  in 
section  3,  where  the  interface  is  defined  more  precisely.  Also  in 
section  3  is  a  description  of  the  internal  tree  form  into  which 
LSI  is  mapped. 

2.2  THE  DSEE  LANGUAGE 

The  user  language  consists  mainly  of  two  parts:  the 
query  specification  and  the  current  element  concept.  The  query 
specification  is  basically  LSL  put  into  a  SEQUEL-like  syntax 
[Chamberlin  and  Boyce  74].  The  query  specification  is  translated 
into  the  correct  format  for  the  LSL  interpreter  by  level  3  of  the 
system.  For  cur  user  language,  the  translation  is  quite 
straightforward,  since  it  is  nearly  the  same  as  LSL.  Other  user 
languages  may  require  a  more  complex  translation  process. 

A  special  class  of  operations  are  available  at  the  user 
interface  for  use  only  by  a  special  user  designated  to  be  the 
data  base  administrator  (abbr.  DBA) .  This  user  is  allowed  to 
create  and  modify  relations  and  to  create  and  destroy  fast  access 
structures  such  as  inverted  files. 

2.2.1  LSL 

An  LSL  query  begins  by  selecting  a  subset  of  the  tuples 
of  a  relation,  say  B1.  Then  El  is  linked  to  some  target  relation, 
say  E2,  based  on  a  linking  predicate.  A  subset  of  the  R2  tuples 
that  are  reached  by  the  link  can  then  be  selected,  followed  by 
the  execution  of  a  second  link,  etc.  The  syntax  for  an  LSL  query 
is : 

Sill  (HIS  <current  element  name>  |  <domain  list>) 

FECji  <relation  name> 

[WHERE  <selector  predicat e>] 

(ilNKll  II  <liiik  predicate> 

TO  <relation  namo> 

[WHERE  <select»r  predicate>]) 

[AND  . . .  T  CE“ . . . } 

Keywords  are  underlined,  "[  ]”  means  zero  or  one  instance,  "  () ” 

means  exactly  one  instance,  and  "  {}  means  zero  or  more  instances. 

After  the  keyword  GIVE,  one  can  name  the  current  element 
that  represents  the  result  of  the  query.  Omitting  this  clause 
causes  the  current  element  provided  by  the  system  to  be  used, 
possibly  destroying  an  old  reference.  When  the  result  is  to  be 
output  directly,  the  domain  list  indicates  the  desired  domains. 

An  example  of  an  LSL  query  appears  in  figure  2,  with  a 
corresponding  query  stated  in  SEQUEL.  The  LSL  link  operator  is 
much  more  powerful  than  the  link  obtained  in  SEQUEL  by  nesting 
query  blocks,  since  the  predicate  that  qualifies  a  link  in  LSL 
may  consist  of  several  simple  conditions  connected  by  logical 
operators  as  is  shown  in  fig.  2.  In  SEQUEL,  however,  links  are 
formed  only  by  one  simple  condition.  As  a  consequence,  the  LSL 
link  operator  can  replace  the  use  of  free  tuple  variables.  This 
can  simplify  queries  considerably,  as  shown  in  figure  2. 
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Figure  2 

Translation  of  a  SEQDEI  Query  into  LSL 

Mi^KAGEE  (MGE#,  N^ME,SAL,rEPT#,EEGION) 

EMELOYEE (E KP# , NAM E, ADDRESS , SAL , JOEDESC , MGE#) 

(i)  The  given  relational  schema 

Give  the  names  of  all  employees,  each  of  whom  receives  a 
higher  salary  than  his  manager. 

(ii)  English  query 


El:  Select  NAME 

from  EMPLOYEE 
where  SAI  > 

Select  SAL 

from  MANAGEE 

where  MGE#  =  E1.MGE# 

(iii)  SEQUEL  query 


give  into  EICHEMELCYEES 
FEOM  EMPLOYEE 

IiNKEE  EY  ( (EMPLOYEE. MGE#  =  MANAGEE.  MGE#) 
AND  (EMPLOYEE. SAL  >  MANAGEE. SAL) ) 
TO  MANAGEE 

(iv)  Corresponding  LSL  statement 
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Figure  3 


DEFINE  iXNK  IS_MAN;^.GEE 
FEOe  EMPLOYEE 
To’fii^.NAGER 

EY  EMPLOYEE. MGR#  =  MANAGER. MGR# 


DEFI^  ^LE^OR  «ITH_a:OP_SALARY 
on” MANAGER 
EY  SAL  >  5000 

(i)  defined  link  and  selector 


GIVE  TOPEMPLOYEES 
FROM  EMPLOYEE 

EY  I S_ MANAGED 
TO  MANAGER 

WHERE  WITH_TOP_SALARY 

(ii)  a  query  using  the  link  and  selector  defined  in  (i) 
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Link  and  selector  predicates  can  be  predefined,  which 
results  in  cataloging  them  in  a  table.  A  predefined  predicate  can 
then  be  referenced  by  name  in  a  WHERE  or  LINKED  BY  clause  in  a 
query  at  some  later  time.  The  syntax  for  defining  predicates  for 
links  is; 

DEFINE  LINK  <link  name> 

FROM  <relation  name> 

TC  <relation  name> 

BY  <link  predicate> 

and  for  selectors  is; 

define  SELECTCR  <selector  name> 

OK  <relation  name> 

WHERE  <selector  predicate> 

Similarly,  we  can  allow  the  definition  of  complex  expressions. 
The  table  of  defined  links  and  selectors  can  be  printed  using  the 
LIST  command.  Any  undesired  link  or  selector  can  be  destroyed 
using  the  DNDEFINE  command.  Sample  link  and  selector  definitions 
are  given  in  figure  3i  and  are  used  in  a  query  in  figure  3ii. 

In  addition  to  this  link  and  selector  predefinition 
facility,  we  alsc  provide  a  simple  macro  facility  for 
substitution  of  keywords.  This  allows  us  to  generate  languages 
fcr  special  applications  from  the  LSI.  Consider  for  example  the 
fcllowing  formulation  chosen  for  a  special  application  purpose. 
The  query  specification  is  the  same  as  in  fig.  3.  The  only 
mcdificaticn  is  the  redefinition  of  some  keywords. 

GIVE  ICEEMELOYEE 

WDC^IS  EMELCYEE 

wile h” IS  MANAGED 

FEQM  MANAGER  -  WITH  TOPSALAEY 

The  query  specification  part  of  LSL  has  been  developed 
further  [Tsichritzis  76].  In  particular,  it  is  shown  that  it  can 
be  used  to  provide  hierarchical-  or  network-like  interfaces  to 
relational  data  bases.  Also,  another  problem  of  LSI  (and, 
similarly  of  SEQUEL)  has  been  solved.  A  new  "keep"  command 
permits  data  from  several  relations  to  be  output  together. 

2.2.2  Current  Elements 

Ihe  "current  element"  concept  allows  easy  communication 
between  user  and  system  about  the  result  of  a  query.  We  do  not 
think  that  it  is  desirable  that  a  user  receives  at  once  the 
entire  subset  of  the  data  base  specified  by  his  query.  Moreover, 
fcllowing  such  a  strategy,  it  would  be  quite  difficult  to  embed 
the  user  language  in  programs  of  some  programming  languages 
(since  subsets  (arrays)  of  dynamic  size  must  be  dealt  with  in  the 
programming  language  and  be  passed  between  the  data  base  system 
and  the  programming  language  system) . 

Therefore,  in  our  system  the  user  names  a  "current 
element".  By  assigning  a  query  specification  to  a  current 
element,  the  current  element  gives,  after  a  query  execution,  a 
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reference  to  the  result  of  the  query  (i.e.,  to  the  value  of  the 
query  specification) .  An  example  of  this  is: 

GIVE  INTO  MY_COER_ELEMENT 
FBCM  EWELCYEE 
WHEHE  SAL  <  2000 

Since  we  consider  a  current  element  as  a  named  set  of  tuples, 
subsequent  modifications  of  tuples  in  system  relations  will  not 
affect  the  inclusion  (or  exclusion)  of  a  tuple  in  the  current 
element. 


For  simplicity  of  use,  there  is  one  current  element  that 
is  provided  implicitly  by  the  system  and  need  not  be  named.  When 
the  user  wants  to  use  several  current  elements  at  a  time,  he 
needs  to  name  them. 

Three  types  of  operations  can  be  applied  on  a  current 

element: 

i.  There  are  operations  such  as  “give  number  of  tuples  in  the 
current  element”  that  inform  the  user  about  properties  of  the 
retrieved  subset. 

ii.  The  output  operations  allow  one  to  indicate  the  domains  that 
are  to  be  output  and  to  format  the  output. 

A  pointer  is  associated  with  each  current  element.  With 
a  "print  next"  command,  one  can  output  one  tuple  at  a  time, 
namely  the  tuple  specified  by  the  pointer.  On  each  "print  next", 
the  pointer  is  automatically  incremented  to  point  to  the  next 
tuple. 


The  tuple  specified  by  the  pointer  can  also  be  updated 
or  deleted.  There  are  commands  to  set  or  reset  the  pointer. 
This  may  cause  certain  problems  when  the  pointer  is  reset  to  a 
tuple  that  has  been  updated  before. 

Also,  a  terminal  command  permits  the  user  to  output  the 
complete  subset  at  a  time,  online  or  offline. 

iii.  A  current  element  can  be  used  in  a  subsequent  query 
specification  instead  of  a  relation  name.  This  provides  mainly 
two  possibilities. 

First,  a  "give  number  of  tuples"  command  can  be  applied 
after  a  query  specification.  If  the  user  realizes  that  the 
cardinality  of  the  subset  retrieved  is  much  too  great,  then  he 
can  restrict  this  subset  by  further  conditions  before  he  outputs 
it.  When  there  are,  for  example,  too  many  employees  with  SAL  < 
2C00  (cf.  the  above  query)  then  he  may  say: 

give  into  SEC_CDEF_ELEMENT 
iicM  MY_CUEE_iLEMENT 
WHERE  NAME  =  JOHN 

Second,  complicated  queries  can  be  broken  down  into  a 
sequence  of  simple  queries,  by  storing  intermediate  results  in  a 
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current  element.  For  example,  the  explicit  formulation  of  joins 
in  the  query  specification  can  be  avoided. 

GIVE 

FROM  MANAGER 

WHERE  MGR#  =  MGR#  OF  SEC__CORR_ELEMENT 

is  equivalent  to 

GIVE  NAME 

FROM  MANAGER 

LINKER  E¥  MGR#  =  MGR# 

TO  EMFICYEE 

WHERE  SAL  <  2000  ANE  NAME  =  JOHN 

The  psychological  experiments  performed  ty  Reisner  et  al 
[75]  seem  to  indicate  that  most  users  have  difficulties  with 
complex  query  specifications.  Thus,  this  latter  tool  might  be 
quite  valuable. 

3.  DESIGN  OF  THE  SYSTEM 

3.  1  OVERVIEW 

3.1.1  Introduction 

Levels  1-3  that  are  outlined  in  figure  1  are  described 
in  figure  4  in  more  detail.  Each  level  is  supported  by  an 
independent  module  that  runs  on  the  virtual  machine  provided  by 
the  next  lower  level.  Each  table  is  maintained  by  a  single  level 
and  can  only  be  accessed  through  commands  provided  by  that  level. 
Thus,  there  are  no  global  data  structures. 

3.1.2  Level  3 

The  user  level,  level  3,  provides  three  types  of 

commands: 

1 .  queries , 

2.  insertions,  deletions,  and  updates,  and 

3.  ERA  commands. 

Queries  are  stated  in  the  user  version  of  LSL  described  in 
section  2.2.  Insertions,  deletions  and  updates  are  tuple-at-a- 
time  operations  on  the  relations  and  current  elements  that  are 
defined  at  this  user  interfatce.  Commands  can  be  used  by  the  DBA 
at  this  interface  to  create,  destroy,  or  modify  the  structure  of 
relations,  to  add  or  delete  domains  of  relations,  and  to  create 
or  destroy  inverted  files.  The  architecture  is  designed  to  allow 
more  complicated  access  structures  to  eventually  be  added  to  the 
system  [Schmid  and  Bernstein  75]. 

Queries  are  compiled  into  a  tree  that  represents  the  LSL 
query.  Then  the  tree  is  sent  to  level  2  to  be  interpreted.  Level 
2  returns  a  subset  element  (called  a  basic  element  in  our  system 
[Erodie  et  al  75])  that  points  to  the  subset  of  the  data  base 


Queries 


insertion , 
deletion,  & 
update 


DBA 


FIGURE  4  SYSTEM  STRUCTURE 
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described  by  the  query  tree.  The  update  and  DBA  commands  are 
serviced  mere  or  less  directly  by  level  2. 

3.1.3  Level  2 

The  access  structure  level,  level  2,  has  three  main 
functions : 

1.  interpet  LSL  queries, 

2.  maintain  the  consistency  of  existing  access  structures, 
and 


3.  create  and  destroy  access  structures. 

The  LSL  interpreter  is  the  major  component  of  this 
level.  The  overall  goal  of  the  interpreter  is  to  minimize  the 
number  of  disk  accesses  required  to  obtain  the  subset  of  the  data 
base  described  by  the  query.  In  interpreting  link  and  selector 
operations,  which  are  passed  in  the  form  of  a  tree,  it  therefore 
must  use  inverted  files  where  they  exist. 


19 


This  level  is  responsible  for  maintaining  the 
consistency  of  inverted  files  during  updates  to  the  relations. 
The  EEA  operations  to  create  and  destroy  inverted  files  are  also 
serviced  by  this  level. 

3.1.4  Level  1 

The  file  map  and  data  structure  software,  level  1, 
provides  two  classes  of  functions. 

The  file  map,  called  Mini-omega,  implements  virtual 
files  (called  record  types  in  our  system) .  Commands  on  record 
types  include  create  and  destroy,  update,  and  retrieve.  Mini¬ 
omega  maintains  tables  that  describe  the  structure  of  each  record 
type  and  a  table  describing  each  donain  in  the  record  types, 

Easic  commands  on  data  structures  are  supported  by  the 
data  structure  system.  Our  current  design  supports  only  E-trees, 
although  ether  structures  such  as  hash  tables  may  be  added  later. 
Access  structures  on  level  2  are  implemented  using  the  more 
primitive  data  structures  provided  by  this  level, 

Ecth  Mini-omega  and  the  data  structure  system  store 
their  data  directly  on  the  UNIX  file  system  [Eitchie  and  Thompson 
74]. 

3.2  QDEEY  LANGOAGE  TEANSIATCE 

The  main  function  of  the  query  language  translator  is  to 
translate  the  user  commands  and  query  specifications  into  the 
internal  form.  Each  command  is  translated  into  a  control  block, 
and  each  query  specification  is  translated  into  a  tree.  Trees 
were  chosen  as  the  internal  form  because  they  are  easy  to 
manipulate  and  modify  for  optimization  and  translation  purposes. 
The  structure  of  the  tree  will  be  described  in  section  3.3  as 
part  cf  the  interpreter  interface. 

Tc  facilitate  the  translation  of  the  user  language  into 
the  tree  form,  the  translator  maintains  two  tables.  One  table 
contains  information  about  the  names  of  relations  and  domains. 
User  names  for  relations  and  domains  are  encoded  as  integers  in 
the  query  tree;  hence  the  system  deals  internally  not  with  user 
defined  names,  but  with  simple  integers.  The  second  table 
centains  definitions  and  usage  statistics  of  defined  links, 
selectors  and  expressions.  The  tree  representing  each  defined 
link,  selector,  or  expression  is  stored  in  the  table  so  that  it 
can  be  directly  added  to  the  query  tree  if  encountered  during 
translation.  Commonly  used  queries  can  normally  be  defined  and 
therefore  will  be  translated  by  this  simple  table  look-up. 
Sometimes,  a  user  will  specify  a  query  that  is  functionally 
equivalent  to  some  defined  query.  Our  system,  however,  does  not 
try  to  test  equivalence,  mainly  because  the  translation  process 
is  reasonably  fast  anyway.  (Of  course,  in  the  general  case  the 
problem  is  undecidable.) 

Finally,  if  specified  by  the  EEA  the  translator  does 
data  compression.  This  compression  is  done  in  the  same  way  for 
input  of  tuples  and  for  the  predicates  in  the  query 
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specification,  and  the  corresponding  decompression  is  done  for 
the  output  of  data.  Consequently,  the  interpreter  and  the  lower 
levels  usually  need  not  know  whether  a  data  value  is  compressed 
or  not;  they  deal  with  the  data  as  it  comes  down  from  the 
translator . 

Different  compression  methods  can  be  used  for  different 
domains.  Which  one  is  used  is  defined  in  the  relation  and  domain 
name  table.  There  are  some  standard  compression  methods  that 
will  be  provided  by  the  system.  J^ll  variable  length  domains  are 
currently  compressed  into  a  fixed  length  form.  Additionally, 
particular  compression  methods  may  be  defined  by  writing  special 
compression  and  decompression  procedures. 

The  advantage  of  doing  compression  at  this  high  level  is 
that  we  do  compression  less  frequently  than  if  Mini-omega,  the 
storage  system,  were  to  do  it.  Consider  a  sequential  scan 
through  a  relation  for  a  tuple  with  a  certain  domain  value.  In 
our  system,  the  translator  does  the  compression  of  the  value  in 
the  predicate  once.  Afterwards,  compressed  values  are  compared 
on  eguality.  On  the  other  hand,  if  Mini-omega  were  to  perform 
the  compression,  then  the  field  value  of  each  record  would  have 
to  be  decompressed  by  Mini-omega  to  make  the  comparison. 

There  are  two  problems  worth  mentioning  regarding  the 
handling  of  compression  at  the  translator  level.  First,  the 
application  of  inequality  comparisons  in  predicates  influences 
either  the  predicate  interpretation  or  the  usage  of  access 
structures.  In  both  cases,  the  compression  techniques  must  be 
order  preserving,  so  that  the  query  can  be  interpreted  at  the 
storage  level  without  decompression.  But  this  restriction  is 
usually  quite  easy  to  fulfill.  For  example,  many  practical 
applications  store  a  number  or  a  character  in  one  byte.  In  this 
case,  there  are  simple  order  preserving  compression  techniques  by 
which  quite  a  bit  of  storage  space  is  saved. 

The  second  point  is  that  predicates  can  contain 
arithmetic  functions,  such  as  SAL  <  LAST_SAL  100  in  a  link.  In 
these  cases,  the  query  interpreter  must  use  decompressed  data, 
and  therefore  must  decompress  values  in  the  predicates  and  the 
retrieved  records.  In  this  case,  our  method  is  a  little  bit 
worse  than  the  conventional  technique.  However,  we  judge  that 
these  cases  occur  infrequently  in  practical  applications,  so  that 
we  get  a  considerable  overall  gain  by  our  method. 

3.3  INTEEPEETEB 

The  Interpreter  directs  the  actual  evaluation  of  the 
query.  It  services  commands  from  the  language  translator  and  uses 
the  facilities  provided  by  Mini-omega  and  the  data  structures 
system.  From  the  translator  it  accepts  the  query  specification 
in  the  form  of  a  tree.  By  utilizing  the  services  of  Mini-omega 
and  the  data  structure  system  it  returns  the  record-type  of  the 
result  of  the  query  and  the  index  of  a  basic  element  containing 
the  indices  of  the  records  that  satisfy  the  query. 

Each  node  of  the  query  tree  corresponds  to  a  record 
type,  a  selector,  a  link,  or  a  set  operation.  Eecord  type  nodes 
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are  always  leaves.  A  selector  has  one  subtree  that  defines  the 
subset  of  the  record  type  on  which  it  operates,  A  link  has  two 
subtrees,  representing  the  subset  of  its  source  record  type  and 
the  target  record  type.  The  set  operations  of  union, 
intersection,  and  difference  each  have  two  subtrees  that 
represent  sets  of  tuples  from  a  single  record  type. 

Associated  with  each  selector  and  link  node  is  a  boolean 
qualification.  This  qualification  is  stored  in  a  predicate  tree. 
Each  node  of  the  tree  is  either  a  constant,  a  domain,  an 
arithmetic  operator,  a  binary  arithmetic  relation,  or  a  boolean 
operator.  The  tree  is  structured  in  the  obvious  way,  with 
constants  and  domains  as  leaves,  and  operators  and  relations  as 
internal  nodes.  For  a  given  value  of  each  of  the  domains  in  the 
tree,  the  tree  evaluates  to  true  or  false. 

The  translator/interpreter  interface  only  involves 
passing  the  root  of  the  query  tree  to  be  solved,  and  returning 
the  result.  The  interpreter  examines  the  data  base  to  select 
those  tuples  that  satisfy  the  query.  The  selected  tuples  are  then 
returned  to  the  translator  in  a  basic  element. 

The  query  tree  is  interpreted  using  a  recursive  endorder 
traversal.  To  evaluate  a  node,  the  children  of  the  node  are 
evaluated  first,  yielding  subsets  of  one  or  two  record  types. 
The  operation  associated  with  the  node  is  then  applied  to  the 
subset  (s)  and  the  result  is  returned. 

The  interface  to  level  1  allows  the  interpreter  to: 

1.  open  and  close  files  and  inverted  files; 

2.  create  and  destroy  basic  elements; 

3.  take  the  set  unicn,  intersection,  and  difference  of  two 
basic  elements; 

4.  obtain  a  basic  element  containing  tuples  with  a  given 
domain  value,  if  that  domain  is  inverted;  and 

5.  initiate  and  execute  sequential  scans  (through  either  a 
file  or  a  subpart  of  a  file  as  specified  by  a  basic 
element)  to  obtain  domain  values. 

The  main  principle  behind  the  translator  is  to  use 
inverted  files  to  avoid,  as  much  as  possible,  going  into  the  data 
base  to  prove  whether  a  given  predicate  is  satisfied.  The 
interpreter  tries  to  work  with  sets  of  tuples  together  and  avoid 
dealing  with  only  one  tuple  at  a  time.  The  following  example  is 
presented  to  illustrate  the  interpreter's  operation. 

Consider  the  following  query: 

SELECT  ALL 

FECB  MANAGEE 

WHEEE  (((AGE=50)  OE  (DEPT= *  TO Y * )  ) 

AND  (SAL>  4000)  OE  (L0C= * TCEONTO) ) 
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Suppose  that  inverted  files  exist  on  the  SAL  and  LOC  domains  of 
MJ^NAGER  and  on  no  other  domains.  The  following  sequence  of 
commands  would  be  issued  by  our  interpreter  to  Mini-omega  and  the 
data  structure  system  to  service  this  query. 

1.  Obtain  an  empty  basic  element,  EE1. 

2.  Put  in  EE1  the  addresses  of  all  MANAGER  tuples  with  LOC  = 

•TORONTO',  using  the  inverted  file  on  the  domain  LOC. 

3.  Obtain  an  empty  basic  element,  EE2. 

4.  Put  in  BI2  the  addresses  of  all  MANAGER  tuples  with  SAL  > 

4000.  (This  is  easy  if  the  inverted  lists  are  sorted  by  SAL.) 

5.  Store  the  union  of  EE1  and  EE2  into  EE1. 

6.  Check  each  tuple  in  MANAGER  pointed  to  by  EE1  to  see  if  AGE  = 

50  or  DEPT  =  'TOY'.  If  so,  retain  it.  If  not,  delete  it. 

In  each  step  we  handle  sets  wherever  possible.  Sets  are  passed 
using  basic  elements.  To  obtain  a  particular  domain  value  of  a 
record,  we  pass  a  basic  element  that  has  just  a  single  entry. 

When  evaluating  a  query,  the  interpreter  does  not  care 
which  data  structure  is  used  to  implement  any  particular  inverted 
file.  That  is,  the  inverted  file  cmmands  are  uniform  across  the 
different  implementations  of  inverted  files.  In  the  case  that 
there  is  a  choice  among  several  access  paths,  where  the 
interpreter  could  use  different  inverted  files,  the  interpreter 
must  be  able  tc  look  up  in  a  table  which  inverted  files  are 
faster.  Apart  from  this  case,  the  interpreter  is  logically 
insulated  from  the  lower  levels  of  the  system. 

A  second  principle  is  that  we  collect  operations  that 
belong  logically  together  intc  a  group  of  commands  to  the  lower 
level.  A  group  of  commands  is  started  by  an  INIT  command.  This 
gives  a  basic  description  of  the  group  to  the  lower  level,  and  is 
valid  for  all  of  the  subsequent  commands  of  the  same  group.  This 
description  is  discarded  with  a  CLOSE  command.  Thus,  we  save 
passing  the  same  information  repeatedly  down  to  the  lower  level 
(for  each  element  of  the  group) .  Moreover,  the  lower  level  need 
net  process  this  information  on  each  operation,  since  it  knwos 
exactly  what  kind  cf  command  is  coming  next. 

Suppose,  for  example,  that  the  interpreter  needs  to  do  a 
sequential  scan  through  a  record  type  to  evaluate  a 
qualification.  Ey  an  INIT  command,  the  interpreter  tells  Mini- 
omega  which  fields  of  a  virtual  file  are  to  be  retrieved  into 
certain  buffer  spaces  provided  by  the  interpreter.  Mini-omega 
new  does  a  table  Icok-up  in  the  record  type  information  table  on 
disk  and  opens  the  files  corresponding  to  the  record  type.  Then 
it  locks  up  the  position  cf  the  fields  within  the  record  type  in 
the  domain  information  table  on  disk,  and  prepares  a  small  main 
memory  table  with  just  the  information  requested  for  the 
subsequent  group  of  operations.  Then  the  interpreter  issues 
GET_NEXT  commands  (that  contain  no  additional  information) . 
Mini-omega  executes  them  without  any  other  table  look-ups.  On  a 
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CLCSE  command,  the  files  are  closed  by  Mini-omega  and  the 
information  about  the  operation  is  discarded.  If  the  operations 
were  not  grouped  using  INIT  and  CLOSE  commands,  then  Mini-omega 
would  have  to  do  the  table  looJc-ups  on  disk  and  would  have  to 
open  and  close  the  files  with  every  GET_.NEXT  command. 

In  addition  to  processing  queries,  the  interpreter 
maintains  the  access  paths.  When  a  tuple  is  inserted,  deleted, 
or  modified  in  a  relation,  auxiliary  access  structures  (e.g., 
inverted  files)  on  the  corresponding  f ile  (s)  must  be  updated  to 
reflect  this  change.  The  interpreter  checks  which  structures  are 
affectd  by  the  update,  and  changes  these  structures 
appropriately. 

3.4  MINI-OMEGA 

The  purpose  of  Mini-omega  is  to  extend  the  UNIX  file 
system  to  i^upport  record  types  (i.e.,  virtual  files).  Fields  of 
a  record  of  a  record  type  are  accessed  by  specifying  logical 
field  and  record  number  and  a  space  unit  into  which  the  field 
values  will  be  put.  Mini-omega  services  several  commands  to  read 
and  write  records  on  the  record  types. 

The  UNIX  file  system  essentially  supports  a  basic  direct 
access  method  (EDAM) .  Once  a  UNIX  file  is  opened,  UNIX  will 
accept  a  word  number  as  a  parameter  and  transfer  a  block  of 
words,  which  begins  at  the  word  given  as  parameter,  into  or  from 
a  C  buffer  [Ritchie  and  Thompson  74]. 

Each  virtual  file  is  stored  as  one  or  more  record  types. 
Mini-omega  implements  record  types  with  records  of  fixed  length. 
Variable  length  fields  are  handled  by  transforming  them  into 
fixed  length  fields  using  a  compression  scheme.  Since  the  fields 
are  supposed  to  be  compressed,  field  boundaries  need  not 
correspond  with  byte  boundaries. 

Each  record  type  is  made  up  of  a  file  partitioned  into  a 
primary  area  that  is  sorted  by  the  key  and  an  overflew  area 
containing  records  that  have  been  added  since  the  last  file 
reerganizaiton .  This  allows  fast  sequential  processing  of  the 
file  with  only  two  buffers.  Shen  the  overflow  area  grows  too 
large,  then  a  reorganization  is  required  to  merge  it  with  the 
primary  area. 

Thus,  in  principle,  the  task  of  Mini-omega  is  quite 
simple.  When  it  gets  a  logical  field  and  record  number  from  the 
interpreter,  it  must  find  the  corresponding  UNIX  file  and 
calculate  the  physical  record  location  and  the  displacement  of 
the  field  in  the  record 

To  do  the  mapping  between  virtual  and  physical  files. 
Mini-omega  maintains  two  major  tables  that  are  on  disk.  The 
record  type  information  table  contains  one  entry  for  each  record 
type  (i.e.,  virtual  file)  known  to  the  system  and  describes  the 
physical  characteristics  of  the  storage  of  this  record  type.  It 
also  references  the  domain  information  table,  which  contains  one 
entry  describing  each  domain  in  each  record  type. 
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lABLE  1 

EECOED  TYPE  INFOEMATION  TABLE  AND  DOMAIN  INFCEMATICN  TAELE 

For  each  record  type,  the  record  type  information  table 

contains; 


1. 

the  internally  generated  UNIX  file  name  of  that  record 
type;  this  name  is  a  fully  qualified  path; 

2. 

the  length  of  a  record  in  bytes; 

3. 

the  total  number  of  records  in  the  record  type; 

4. 

the  total  number  of  sorted  records  in  the  record  type; 

5. 

an  index  into  the  domain  information  table,  which  gives 
the  domain  information; 

6  . 

the  number  of  domains  fcr  this  record  type; 

7. 

the  index  of  the  key  domain; 

8. 

the  index  of  the  sorted  domain. 

Fcr  each 
contains: 

domain  in  each  record  type,  the  domain  information  table 

t 

t 

1 . 

that  domain's  byte  offset  in  a  record; 

2. 

the  bit  offset  in  the  leftmost  containing  byte,  since 
packing  is  used; 

3. 

the  packed  width; 

4. 

the  unpacked  width; 

5. 

an  indication  of  whether  or  not  the  file  is  sorted  on 
this  domain; 

6. 

if  the  domain  is  inverted,  then  the  index  of  its 
inverted  file  is  placed  here; 

7. 

an  index  to  the  location  of  the  maximal  value  for  that 
domain ; 

8. 

an  index  to  the  location  of  the  minimal  value  for  that 
domain. 

The  position  of  an  entry  in  the  record  type  information 

tatle  corresponds  to  the  internal  record  type  name  (which  is  this 
number) .  The  same  holds  for  the  relative  position  of  a  domain  in 
the  domain  information  table  which  corresponds  to  the  internal 
domain  name. 
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The  access  to  the  record  type  and  domain  infcrmation 
tables  is  implemented  by  a  set  of  mcnitcr  procedures  that  form 
part  of  Mini-omega.  In  the  rare  cases  that  other  modules  need 
access  to  these  tables,  they  use  the  monitor  procedures.  The 
contents  of  these  tables  are  described  in  table  1. 

Mini-omega  supports  the  following  commands.  For  each 
command,  the  name(s)  in  parenthesis  is  the  actual  procedure  name. 
Each  is  described  in  more  detail  in  section  4.3.2. 


1.  open  a  record  type  (FBEP^FIIE) ; 

2.  close  a  record  type  (TEEMINATE^USE) ; 

3.  initial  a  domain  for  sequential  access 

(EOMAIN_ACCESS_INIT,  MULTI_DOMAIN_INIT) ; 

4.  access  a  record  value  (ACCESS_DOMAIN,  C01LECT_D0MAINS) ; 

5.  access  a  record  value  through  a  basic  element 
(EY_EE„ACCESS_DCMAIN,  EY_EE_COLLECT_DOMAINS) ; 

6.  close  a  domain  for  sequential  access 

(END_DCMAIN_ACCESS) ; 

7.  project  on  a  domain  (EEOJ_DOMAIN) ; 

8.  find  a  minimum  or  maximum  domain  value  (FIND_EXTEEMUM , 
MD1TI_VAIUE_FIND) ; 

9.  construct  an  inverted  file  (INVEET^DOMAIN) ; 

10.  monitor  routines  for  accessing  disk  resident  tables  (cf. 
section  4.3.3)  . 

3.5  DAIA  STRUCTUEES  SYSTEM 


The  data  structures  system  supports  commands  to  create, 
access,  modify,  and  destroy  basic  elements  and  inverted  files.  In 
our  current  implementation,  all  such  structures  use  E-trees 
[Eayer  and  McCreight  72]  .  Eayer  has  shown  that  E-trees  have  a 
good  overall  performance  for  a  wide  range  of  applications  of 
inverted  files.  In  no  case  are  they  very  bad.  Certainly,  for 
seme  special  applications  there  exist  better  techniques  (e.g,, 
hashing  in  not  very  dynamic  applications) .  While  limiting  the 
system  to  one  data  structure  leads  to  a  small  and  clean  system,  a 
more  versatile  system  would  probably  require  additional  data 
structures. 


A  basic  element  is  a  variable-sized  ordered  set  of 
logical  record  numbers.  In  addition,  there  may  be  information 
associated  with  every  record  number.  E-trees  are  well-suited  to 
implement  variable-sized  ordered  sets.  For  efficiency  reasons, 
the  usual  insertion,  deletion,  and  accessing  routines  for  E-trees 
must  be  extended.  When  a  sequence  of  records  is  inserted  into  a 
E-tree  in  sorted  order , significant  savings  over  random  insertions 
can  be  obtained  by  incorporating  this  knowledge  in  the  insertion 
routine.  A  similar  saving  can  be  obtained  for  a  sequential 


scan 
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over  randomly  accessing  of  a  E-tree.  Hence,  independent  routines 
for  sequential  and  random  insertion  and  sequential  and  random 
accessing  are  provided. 

For  these  reasons,  we  decided  to  use  E-trees  as  data 
structures  not  only  for  the  implementation  of  inverted  files,  but 
for  basic  elements  as  well. 

The  data  structure  system  supports  two  kinds  of  basic 
elements  (abbr.  EE),  both  of  which  are  sorted  by  increasing 
record  index: 

System  Basis  Element :  a  system  EE  is  used  to  store  something 
iSSEomir  such  as  an  intermediate  result  of  the  evaluation  of  a 
query.  It  is  expected  to  be  destroyed  scon  after  it  is  created. 
Since  the  EE  has  a  short  life  span,  header  information  for  a 
system  EE  is  kept  simple  and  is  stored  in  main  memory.  This 
allows  the  CEEi^.TE  and  DESTROY  operations  to  be  simple  and  fast. 
Since  the  EE  could  be  quite  large,  the  body  of  the  B-tree  is 
stored  on  disk,  when  it  is  larger  than  one  page.  All  EEs  of  this 
type  are  stored  on  a  single  UNIX  file. 

ii.  Permanent  Basic  Element :  a  permanent  EE  is  one  that  will  be 
kept  for  a  long  period  of  time.  That  is,  it  is  expected  to  be 
retained  after  the  current  query  is  processed.  Header  information 
for  a  permanent  basic  element  is  kept  on  disk  in  a  table  of 
headers  for  all  E-trees  except  system  EEs. 

In  addition  to  EEs,  five  kinds  of  inverted  files  are 
supported,  corresponding  to  the  five  different  types  of  domains 
on  which  an  inverted  file  can  be  constructed.  The  five  types  of 
domains  are: 


i . 

key  domain. 

ii . 

key-like  domain. 

iii. 

domain  with  very  few 

values. 

iv. 

domain  with  several 

values  (between 

(i) 

and  (iii) )  , 

V. 

domain  with  very 
involving  ranges. 

long  keys  or 

non- 

discrete  values 

The  choice  of  which  kind  of  inverted  file  should  be  used  is 
currently  left  up  to  the  EEA. 

E-trees  have  nodes  that  are  pages  of  linear  storage. 
Within  a  page,  all  properties  of  linear  storage  are  preserved 
(e.g.,  sequential  or  binary  search,  shifting).  Passage  from  one 
page  tc  another  is  done  by  following  a  pointer.  When  a  page  gets 
filled  up,  it  is  split  into  two  pages  followed  by  a  readjustment 
of  pointers.  Similarly,  two  adjacent  pages  that  are  too  empty 
are  joined  into  one  page. 

In  our  system,  we  use  E-trees  where  only  the  leaf  pages 
contain  real  information.  Non-leaf  pages  contain  only  pointers 
tc  other  pages.  Thus,  sequential  processing  of  information 
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involves  a  scan  through  the  leaf  pages  in  a  left  order  tree 

traversal.  Other  advantages  (and  disadvantages)  of  this  approach 
in  comparison  with  the  approach  that  all  pages  contain  real 

information  (as  originally  proposed  in  [ Eayer  and  McCreight  72]) 
are  discussed  in  [ Knuth  74],  [Wedekind  74], 

An  entry  in  a  leaf  page  consists  of  a  key  field  and 
information  field.  When  a  E-tree  is  used  to  implement  BEs,  then 

the  key  field  contains  a  record  index  and  the  information  field 

is  usually  compressed  to  zero  length.  When  a  E-tree  is  used  to 
implement  an  inverted  file,  the  key  field  contains  the  domain 
value,  and  the  information  field  the  corresponding  record  index. 

To  save  space,  a  domain  value  is  not  necessarily 
associated  with  each  record  index  for  all  the  different 
distributions  of  domain  values  mentioned  above.  Either  a  fixed 
or  a  variable  number  of  record  numbers  can  be  associated  with  one 
domain  value.  For  some  types  of  keys,  an  interval  of  domain 
values  appears  instead  of  single  domain  value.  For  each  of  these 
cases  the  algorithms  differ  only  in  the  treatment  of  leaf  pages. 

Optimal  page  sizes  are  about  200-500  entries  when 
movable  head  disks  are  used  for  B-tree  storage.  In  our  system, 
we  have  chosen  the  page  size  equal  to  the  UNIX  buffer  size  of  256 
16-bit  words,  which  is  not  too  far  from  the  optimum.  For  files 
cf  up  to  10,000  to  100,000  records,  a  B-tree  will  then  have  about 
2-3  levels  (including  the  root  page) .  During  the  time  that  an 
inverted  file  or  BE  is  actively  used,  the  root  page  is  usually 
kept  in  core.  Then,  1-2  disk  accesses  will  be  required  to  find  a 
particular  key  value. 

The  data  structure  module  supports  the  following 
commands:  For  each  ccmmand,  the  name  in  parenthesis  is  the 

actual  procedure  name.  Each  is  described  in  more  detail  in 
section  4.4. 

1 .  Create  a  E-tree  (CEEATE_E_TEEE) ; 

2.  Destroy  a  E-tree  (DESTHOYE_TEEE) ; 

3.  Prepare  a  E-tree  for  sequential  insertions 

(EE_INSEPT_SEQ_INIT) ; 

4.  Insert  a  key  into  a  E-tree  (EE_KE_INSEET) ; 

5.  Close  a  E-tree  for  sequential  insertions  (END_B_I_S) . 

In  addition  to  the  above  commands,  the  data  structu’’*^ 
module  will  have  to  support: 

6.  random  updates  on  E-trees,  primarily  to  allow 

maintenance  of  inverted  files; 

7.  sequential  accessing  of  key  values  of  a  E-tree. 

Hcwever,  (6)  and  (7)  have  not  yet  been  implemented. 
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Tc  support  these  commands,  the  data  structure  module 
uses  three  tables:  the  B-tree  table,  the  inverted  file  table, 
and  the  system  EE  table. 

The  E-tree  table  is  disk  resident  and  contains 
information  about  all  E-trees  except  system  EEs  (see  table  2) . 
When  a  E-tree  other  than  a  system  EE  is  created,  header 
information  about  the  E-tree  is  stored  in  an  entry  in  the  E-tree 
table.  The  index  of  the  E-tree  header  in  the  table  becomes  the 
name  of  the  E-tree.  The  system  BE  table  is  a  small  main  memory 
table  of  similar  structure  to  the  B-tree  table. 

The  inverted  file  table  contains  information  about  all 
UKIX  files  associated  with  inverted  files.  One  UNIX  file  is  used 
tc  store  all  inverted  files  associated  with  cne  record  type. 
Therefore,  there  is  an  entry  in  the  inverted  file  table  for  each 
record  type.  The  entry  includes  the  UNIX  file  name,  the  number  of 
inverted  files  on  the  record  type  and  storage  administration 
information  to  keep  track  of  free  pages  in  the  UNIX  file.  When 
an  inverted  file  is  created,  the  entry  corresponding  to  the 
record  type  number  is  accessed.  If  the  entry  exists,  the  same 
file  is  used.  Otherwise,  a  new  file  is  created. 

3.6  UNIX  FILE  CFGi^NIZ^^.TION 

There  are  two  main  points  to  consider  when  assigning 
UNIX  files  tc  stcre  record  types.  E-trees,  and  tables.  First, 
the  UNIX  system  allows  files  to  grow  and  shrink  dynamically. 
This  flexibility  should  be  exploited  by  allowing  UNIX  to  do  as 
much  of  the  storage  administration  as  possible.  Second,  no  more 
than  about  12  files  can  be  open  simultaneously.  Consequently, 
information  must  he  be  distributed  cn  files  sc  that  very  few  UNIX 
files  need  to  be  open  for  one  operation  of  the  interpreter  (e.g., 
a  restriction,  a  link) .  Also,  they  should  be  distributed  in  a  way 
that  avoids  unnecessary  opening  and  closing  of  files  during  short 
intervals  of  time.  This  leads  to  the  following  organization. 

One  UNIX  file  is  used  tc  store  cne  record  type  (except 
the  case  cf  transposed  files,  where  several  UNIX  files  are  used). 

All  E-trees  representing  EEs  are  stored  on  cne  UNIX 

file. 

All  E-trees  representing  inverted  files  on  one  record 
type  are  also  stored  on  a  single  UNIX  file.  Different  UNIX  files 
are  used  for  E-trees  representing  inverted  files  cn  different 
record  types. 

One  UNIX  file  is  used  for  tables  containing  system 
information,  except  for  the  table  of  system  EE  headers  which  is 
kept  in  main  memory. 
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TABLE  2 

E-TREE  TAELE  AND  SYSTEM  BE  TABLE 


E-treg  Tatlj 

Each  entry  in  the  B-tree  table  contains  one  E-tree 
header,  including: 

integer  value  which  represents  the  type  of  the  B- 
tree:  permanent  EE  or  one  of  five  types  of  inverted  file 
(headers  for  system  BEs  are  in  main  memory) . 

2»  Filling  Factor;  the  percentage  of  the  node  that  will  be 
initially  filled.  In  inverted  files  and  permanent  BEs  there  will 
probably  be  updates  after  the  initial  filling  of  the  E-tree. 
Therefore,  it  is  desirable  to  leave  pages  partially  empty. 
System  EEs  are  temporary  and  will  not  be  updated,  so  the  filling 
factor  can  be  1009?. 

ZiiS  Name;  the  name  of  the  UNIX  file  on  which  the  B-tree  is 
stored. 

Scot  Address;  the  file  address  of  the  page  that  is  the  root 
node  of  the  E-tree. 

KJY  Length;  the  key  length  in  bytes 
6.  Number  of  Keys :  the  number  of  keys  in  the  whole  E-tree 

Lglormatign  Field  Length;  the  length  of  the  information  field 
associated  with  each  key  in  an  inverted  file.  For  key  domains  and 
key-like  domains,  the  length  is  two  bytes  (i.e.,  one  record 
index) .  For  domains  with  several  values  and  fixed  length 
information  fields,  the  length  will  vary  for  different  cases. 
For  domains  with  few  values  and  a  variable  length  information 
field,  an  indioator  of  -1  is  stored. 

When  a  node  becomes  full  and  there  are  still 
insertions  into  that  node,  two  things  can  happen:  either  the 

node  can  be  split,  or  the  node  can  do  an  overflow  with  one  or 
both  of  its  brothers.  The  overflow  factor  determines  which 
course  to  take.  If  the  fullness  of  the  brother  of  the  full  node 
exceeds  the  overflow  factor,  then  the  node  is  split.  Otherwise, 
an  overflow  is  done. 

Onderf low  Factor:  This  is  the  reverse  decision  of  overflow.  If 
the  fullness  of  the  brother  of  the  near-empty  node  is  less  than 
the  underflow  factor,  then  the  near-empty  node  and  the  brother 
are  concatenated. 

1C.  Record  In  E-trees  representing  inverted  files,  the 

record  type  number  is  stored. 
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t;^ELE  2  (cont'd) 

E-TREE  TAELE  AND  SYSTEM  EE  TAELE 


Table 

Each  E-tree  for  a  system  EE  has  cue  entry  in  the  system 

EE  table,  including: 

1.  Root  address 

2.  Key  Length 

3.  Number  of  Keys 

These  have  the  same  interpetaticn  as  for  the  E-tree  table.  The 
other  E-tree  table  values  are  not  applicable. 
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4.  DETAILED  DESCRIPTION 

In  this  section,  we  present  a  detailed  description  of 
the  mere  important  modules  and  procedures  in  our  system.  Note 
that  the  description  is  not  complete,  but  rather  highlights  some 
interesting  points. 

4.1  TE^.NSLATOE 


The  translator,  which  does 
generation  of  the  tree  form  of  the  guery, 
using  well-known  syntax-directed  compiler 
it  is  not  of  particular  interest,  and  we 
here. 


syntax  checking  and 
has  been  implemented 
techniques.  Therefore, 
will  net  describe  it 


4.2  INTEEPEETEB 

From  the  translator  the  interpreter  accepts  a  query  in 
the  ferm  cf  a  tree.  The  entries  cf  the  nodes  of  the  tree  must 
all  reside  in  an  integer  array.  Non-integer  constants  are  stored 
in  separate  arrays,  pointed  to  be  indices  stored  in  the  integer 
tree  array.  Character  constants  are  found  in  a  separate 
character  array.  In  the  future,  separate  arrays  will  be  needed 
tc  store  double  and  float  constants. 


A  query  tree  is  evaluated  in  two  passes  by  a  call  to  a 


first  pass,  driven  by  the  routine 
of  the  predicate  trees.  The  marks 
trees  require  a  sequential  pass 
of  the  record  type  on  which  the 
defined.  The  second  pass,  driven 
procedure  QOEBY_TREE_TEA.VEESAL,  evaluates  the  query  tree 
the  actual  data  in  the  data  base. 


driver  routine,  CALL_EABE.  The 
PBEPASS_TBEE,  marks  the  nodes 
indicate  which  portions  of  the 
over  the  individual  records 
associated  link  or  selector  is 
by  the 
based  on 
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4.2.1  First  Pass:  Marking  Predicate  Trees 


PEEPASS_TEEE:  This  procedure  traverses  the  query  tree  by  a 
recursive  endcrder  traversal.  At  each  node  of  the  query  tree, 
the  procedure  returns  the  name  of  the  record  type  that  results 
from  the  evaluation  of  the  subtree  rooted  at  that  node.  If  the 
node  is  a  link  or  restriction  operation,  then  the  associated 
predicate  tree  must  be  marked.  The  markings  are  used  by  the 
second  pass  to  avoid  sequential  processing  wherever  possible. 
Each  node  is  marked  in  its  parent  node  to  indicate  whether  it 
requires  a  sequential  scan  of  a  record  type.  Marking  is 
accomplished  by  a  call  to  MAEK_SQ_NON_SQ. 

MAEK_SQ_NCK_SQ:  This  procedure  recursively  traverses  the  boolean 
portions  of  a  prediate  tree.  A  node  that  roots  a  subtree  that 
represents  an  atomic  clause  of  the  predicate  is  processed 
separately  by  a  call  to  the  procedure  MAEK. 

If  the  predicate  tree  being  examined  is  associated  with 
a  link,  (say)  from  record  type  x  to  record  type  y,  then  we  avoid 
sequential  processing  of  y  only  if  all  domains  of  y  that  appear 
in  the  predicate  tree  are  inverted.  (However,  more  sophisticated 
algorithms  could  be  designed.)  Therefore,  each  boolean  node  is 
marked  as  requiring  sequential  processing  if  any  domain  in  either 
of  its  subtrees  is  not  inverted. 

If  the  predicate  tree  being  examined  is  associated  with 
a  selector,  then  the  marking  is  somewhat  more  complex.  We  first 
adopt  the  convention  that  if  one  subtree  of  a  node  requires 
sequential  processing  and  the  other  requires  non-se quential 
processing,  then  the  ncn-sequential  subtree  is  made  the  left 
subtree  of  the  node.  Since  the  predicate  tree  is  evaluated  left- 
tc-right  in  the  second  pass,  this  convention  guarantees  that  non¬ 
sequential  processing  is  accomplished  before  sequential 
processing  wherever  possible. 

Now,  in  a  selector  predicate  tree,  if  both  children 
require  sequential  processing,  then  the  parent  node  is  marked  as 
being  sequential.  Similarly,  if  both  children  are  non-sequential, 
then  the  parent  is  non-sequential.  If  one  child  is  non-sequential 
and  the  other  sequential,  then  the  two  subtrees  are  switched,  if 
necessary,  to  satisfy  the  above  left-right  convention.  In  this 
case,  if  the  parent  node  is  an  OE,  then  we  are  unable  to  make 
effective  use  of  the  fact  that  the  left  subtree  need  not  be 
processed  sequentially.  So,  the  node  is  marked  as  requiring 
sequential  processing.  Eut,  if  the  parent  node  is  an  AND,  then  we 
can  avoid  testing  the  predicate  tree  on  every  record  of  the 
record  type.  This  is  done  by  first  evaluating  the  left  (non¬ 
sequential)  subtree  on  the  whole  record  type  and  then  testing  the 
right  (sequential)  subtree  on  only  these  records  that  satisfy  the 
left  subtree.  Such  an  AND  node  is  marked  (in  its  parent)  as 
requiring  non-sequential  processing,  since  only  a  subset  of  the 
records  of  the  record  type  need  be  accessed. 

Finally,  MAEK_SQ_NON_SC  reaches  an  arithmetic  relation 
node  (i.e.,  <,  <,  >,  >,  =,  ^)  which  roots  a  tree  representing  an 
atomic  clause  of  the  predicate.  If  the  predicate  tree  is 
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associated  with  a  selector  and  the  atomic  claus 
arithmetic  operations,  then  inverted  domains  will  not 
sequertial  processing.  So,  the  node  may  he  marked  as 
sequential  processing  (without  looking  into  the  clau 
Otherwise,  MARK  is  called  to  traverse  the  predicate  on 
and  right  in  the  case  of  a  selector,  only  on  the  r 
case  of  a  link,  since  by  convention  the  domains  of 
record-type  of  a  link  must  appear  only  on  the  right 
arithmetic  relation  of  a  predicate. 


e  requires 
help  avoid 
requiring 
se  itself)  . 

the  left 
ight  in  the 
the  target 
side  of  an 


MASK:  MABK  simply  examines  the  domain 
or  not  all  the  domains  involved  in  a 
MABK  returns  sequential  if  any  domain 
it  returns  non-sequential . 


table  to  determine  whether 
predicate  are  inverted, 
is  not  inverted;  otherwise 


4.2.2  Second  Pass:  Evaluating  the  Query  Tree 


QUEBY_TBEE_TBAVEBSAL:  This  procedure  recursively  traverses  the 
query  tree  to  evaluate  the  query  based  on  the  contents  of  the 
data  base.  It  identifies  the  type  of  operation  for  eaoh  node  (set 
operator,  selector  or  link)  and  passes  control  to  the  appropriate 
rcufine  to  process  the  node.  It  returns  a  EE  representing  the 
subset  of  the  data  base  described  by  the  query  tree.  We  will 
discuss  set  operations  and  selectors.  We  will  not  discuss  link 
evaluation,  since  it  is  quite  similar  to  selectors  and  has  not 
yet  been  fully  implemented. 


Set  operations  are  performed  by  the  routines  SETTOP  (for 
union  and  intersection)  and  DIFFERENCE  (for  set  difference) . 
Each  child  of  the  node  is  evaluated  by  a  recursive  call  to 
QDEBY_TEEE_TEAVEBSAL.  The  resultant  basic  elements  are  then 
unioned,  intersected,  or  differenced  and  the  basic  element 
containing  the  result  is  returned. 


Evaluation  of  a  selector  proceeds  in  two  steps.  In  the 
first  step,  the  record  descriptor  of  the  node  is  checked  to 
determine  whether  the  selector  is  applied  to  an  entire  record 
type  or  to  a  subset  of  a  record  type  represented  by  a  subguery. 
If  the  selector  is  applied  to  a  subquery,  then  QDEEY_TEEE_TRA- 
VEESAl  is  called  (recursively)  to  traverse  the  query  tree 
representing  the  subquery;  this  returns  a  basic  element 
containing  the  record  indices  of  the  records  satisfying  the 
subquery.  If  the  selector  is  applied  to  the  whcle  record  type, 
then  the  selector  is  evaluated  directly.  The  second  step  is  to 
call  FEEEICATE_TBEE_TBAVERSAL  to  evaluate  the  predicate  tree  on 
the  record  type. 


FEEDICATE_TREE_TEAVEBSAL:  This  procedure  first  checks  its 
calling  parameters  to  determine  whether  a  EE  or  a  whole  record 
type  is  to  be  considered.  It  then  checks  whether  the  predicate 
has  been  marked  as  requiring  sequential  or  non-sequential 
processing.  If  non-sequential  processing  is  possible, 
NCN_SEQ__EVAl_BESTBICTION  is  invoked  to  evaluate  the  predicate. 
This  routine  basically  executes  set  operations  on  inverted  lists, 
and  is  not  discussed  here.  Otherwise,  a  call  is  made  to  EVAL_ALL 
(if  the  whcle  record  type  is  to  be  considered)  or  EVA1_A1L_BY_EE 
(if  a  EE  is  to  be  considered) . 
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BVP.L^P.ll  and  EV?.L_ALL_EY_EE:  These  routines  are  used  to  evaluate 
a  predicate  over  a  set  of  records.  First,  all  the  domains  that 
appear  in  the  predicate  are  entered  into  the  domain  information 
tatle  by  the  procedure  SET_DOMAINS.  This  prepares  EVAL^ALL  to 
communicate  with  Mini-omega.  The  procedure  0TILITIES2  is  then 
invoked  to  actually  evaluate  the  predicate  tree. 

UIILITIES2:  This  routine  is  responsible  for  evaluating  a 

predicate  tree  on  individual  records  of  a  record  type  (see  figure 
6).  First,  one  of  the  following  four  procedures  is  called: 

i.  CHECK:  if  the  predicate  is  on  one  domain  and  the  tree  is 
evaluated  on  all  records  of  the  record  type; 

ii.  MDLTI^CHECK:  if  the  predicate  is  over  multiple  domains 
and  the  tree  is  evaluated  on  all  records  of  the  record 
type; 

iii.  VIA_EE_CHECK:  same  as  CHECK,  except  over  a  subset  of  the 
record  type  specified  by  a  BE; 

iv.  MDLTI_EY_EE_CHECK:  same  as  MDLTI_CHECK  except  over  a 
subset  of  the  record  type. 

The  called  procedure  begins  by  instructing  Mini-omega  to  prepare 
the  required  domain (s).  A  pointer  to  the  buffer  in  which  the 
values  of  the  domains  are  to  be  stored  is  returned  and  the  widths 
of  the  domains  and  their  offsets  in  the  buffer  are  entered  into 
the  domain  table.  The  procedure  INIT_P_TEEE  (which  in  turn  calls 
INIT^EHED)  is  called  to  prepare  the  predicate  tree  for  evaluating 
the  predicate  on  individual  records. 

Next  an  empty  system  EE  is  acquired  by  a  call  to 
CEEATE_E_TPEE  on  the  E-tree  level  of  the  system.  This  EE  is  used 
tc  store  the  record  indices  cf  the  result  of  the  evaluation. 
This  EE  is  then  prepared  for  the  sequential  insertion  of  the 
indices  of  the  records  that  satisfy  the  predicate.  Then  each  of 
the  records  with  which  we  are  concerned  is  successively  accessed 
and  the  values  cf  the  required  domains  are  stored  in  the 
previously  specified  buffer,  each  domain  starting  at  the 
appropriate  offset.  A  call  to  0TILITIES3  determines  whether  or 
net  the  record  satisfies  the  predicate.  This  in  turn  dictates 
whether  or  not  the  record  index  is  inserted  into  the  EE. 

INIT_f_TEEE:  This  routine  recursively  tranverses  the  boolean 

nodes  cf  the  predicate  tree.  At  arithmetic  relation  nodes  (which 
root  atomic  clauses)  control  is  passed  to  INIT_FEED,  so  that 
domain  and  operator  nodes  in  each  clause  can  be  initialized  in 
preparation  of  evaluation.  For  a  predicate  of  a  selector,  both 
the  left  and  right  subtrees  are  initialized.  For  a  link,  since 
the  domains  of  the  target  record-type  appear  only  on  the  right, 
only  the  right  subtree  need  be  initialized.  (The  left  side  (the 
•source*  record-type  side)  is  initialized  in  a  separate  pass.) 
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INIT_P5ED:  In  each  domain  node,  the  width  of  the  domain  and  a 
pointer  to  the  buffer  position  where  its  value  will  be  returned 
(by  Mini-omega)  is  set.  If  cue  of  these  domains  is  compressed  and 
an  arithmetic  operation  will  be  done  on  this  domain,  then  a 
buffer  is  allocated  for  the  decompressed  value.  A  pointer  to  this 
buffer  and  its  width  are  also  set.  In  arithmetic  operator  nodes, 
a  double  length  buffer  is  allocated  for  the  intermediate  result 
and  a  pointer  to  it  is  set.  Note  that  arithmetic  operators 
involve  only  domains  containing  numeric  rather  than  character 
values . 

UTILITIES  3:  After  all  of  the  above  initialization  has  been 
accomplished,  this  procedure  finally  evaluates  the  predicate 
against  a  record  of  the  reccrd-type  (see  figure  7) .  The  values 
of  the  various  domains  of  the  record  with  which  the  procedure  is 
concerned  must  reside  in  the  buffers  for  which  the  predicate  tree 
has  been  initialized  (UTILITIES  2) .  The  main  evaluation  procedure 
is  EVA I. 

EVAL:  EVAL  recursively  evaluates  the  subtrees  rooted  at  boolean 
nodes  cf  the  predicate  tree.  Cnee  a  subtree  that  represents  an 
atcmic  clause  of  the  predicate  is  reached,  control  is  passed  to 
either  CCMEIEX_PEED_EVAL  or  SIMPLE_PRED_EVAI  depending  upon 
whether  cr  not  the  atomic  clause  involves  arithmetic  operations. 
The  selected  routine  returns  whether  or  not  the  clause  is 
satisfied  by  the  record. 

SIMPL I_PEEE_EVAL:  This  function  evaluates  a  simple  atomic 
clause.  It  determines  whether  the  values  involved  are  numeric  or 
character  and  then  calls  either  NUMEEIC_COMPAEE  or 
CHARA.CTER_CCMEAEE  to  determine  whether  or  not  the  clause  is 
satisfied.  FETCH_TYPE  is  used  to  supply  the  compare  routine  with 
the  types  of  and  pointers  to  the  values  to  be  compared. 

CCMPLEX_PRED_EVAL :  This  routine  performs  the  required  operations 
by  means  of  EVALUATE_ARITH  and  finally  calls  NUMERIC_COMPARE  to 
determine  whether  the  predicate  is  satisfied.  As  in 
SIMPLE_PPED_EVAL,  it  calls  FETCH_TYPE  to  supply  NUMEFIC_COM PARE 
with  the  types  of  and  pointers  to  the  two  values  to  be  compared. 

FETCH^TYPE:  This  function  may  be  used  only  on  OPERATOR, 
CONSTANT,  and  DOMAIN  nodes.  It  returns  the  length  of  the  value 
associated  with  the  node  and  sets  a  pointer  to  the  value. 

CHA.RACTER_COMPAEE :  This  routine  returns  the  truth  value  obtained 
when  the  character  strings  passed  are  compared  by  the  operator 
passed . 

NUMERIC^  (CCMPAEE/EVAL)  :  This  is  a  dual  purpose  routine,  the  mode 
of  which  is  determined  by  the  operator  passed.  If  the  operator  is 
of  the  comparison  type  (>,  >,  <,  <,  =,  then  the  function 
serves  to  compare  the  two  values  passed  (by  pointer)  and  returns 
the  truth  value  of  this  comparison.  If  the  operator  is  of 
arithmetic  type  (+,-,/, x)  then  the  specified  evaluation  takes 
place  and  the  (double)  result  is  stored  in  the  (double)  buffer 
indicated  by  the  pointer  passed.  In  either  case  the  values 
involved  are  converted  to  double  so  that  values  are  always  of  one 
type. 
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EVi^.LO^TE^AEITH :  This  routine  recursively  traverses  through  a 

subtree  that  represents  an  atomic  clause  that  requires  arithmetic 
evaluations.  Ey  calling  FETCH_TYPE  and  then  NOMEBIC_EVAL ,  it 
stores  the  buffer  pointed  to  by  each  numeric  operator  node  the 
arithmetic  value  (for  the  record  being  processed)  of  the  subtree 
of  which  the  operator  node  is  the  root. 

4.3  MINI-CKEGA 

The  function  of  Mini-omega  is  to  support  file  operations 
on  virtual  files  for  the  interpreter.  Mini-omega  and  the 
interpreter  generally  communicate  by  passing  data  to  be  read  or 
written  through  buffers.  An  interface  also  exists  with  the  E- 
tree  level;  since  Mini-omega  handles  all  file  accessing,  it  must 
be  used  to  service  even  simple  EE  commands. 

Information  from  the  tables  is  brought  into  main  memory 
as  it  becomes  pertinent  to  the  processing  of  commands.  The  first 
five  fields  of  the  record  type  information  table  are  read  and 
stored  in  the  main  memory  FILE^INEO  table  when  initiating  access 
to  the  appropriate  record  type,  while  the  remaining  three  are 
extracted  as  needed.  The  first  six  fields  of  an  entry  in  the 
domain  information  table  are  read  and  stored  in  the  main  memory 
DCM_INFO  table  when  initiation  of  domain  accessing  takes  place, 
while  the  remaining  two  are  used  when  needed.  The  required 
information  is  brought  from  SYSTEM_INFO  into  main  memory  tables, 
which  also  contain  additional  information  required  during 
processing.  The  SYSTEM^INFO  file  is  opened  only  once  at  the 
start  of  a  session  with  a  certain  data  base,  and  closed  when  the 
session  is  finished. 

Enough  space  is  allocated  in  main  memory  to  hold 
complete  information  tables  for  one  file  and  all  of  its  domains. 
As  additional  files  are  opened,  a  stacking  mechanism  is  used  to 
allocate  more  storage  dynamically.  A  master,  core-resident  file 
directory  stores  pointers  to  all  of  the  tables  information  for 
each  record  type, 

4,3.1  Tables 

The  following  three  tables  are  maintained  by  Mini-omega. 

i.  FILE_riEECTOEY:  This  table  contains  three  entries  for  each 
record  type.  Most  of  the  procedures  in  Mini-omega  start  off  by 
obtaining  necessary  file  information  from  this  directory. 

PTB_FILE_INFO_TABLE:  the  address  of  the  file  information 

table  for  the  given  record  type; 

PTE_DCM_INFO_TABLE:  the  address  of  the  domain  information 

table  for  the  given  record  type; 

IFILEEDM:  the  interpreter  level's  file  number  for  a  record 

type. 

ii.  ECM^INFO:  Each  entry  of  this  table  holds  information  about 
one  domain.  Entries  are  passed  by  pointers. 
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DOHDES:  a  zero-origin  index  indicating  the  location  of  domain 
^lithin  the  record  type; 

BIT_WIETH:  the  packed  width; 

EYIE_WIDTH;  the  unpack  width; 

SCETEE_OE_NOI ;  whether  or  not  the  file  is  sorted  on  this 
domain ; 

INVEETEE_OE_NOT:  whether  or  not  an  invertedfile  exists  on 

this  domain; 

COMEINEB_EYTE_WIDTH;  size  of  buffer; 

EDrE_CFFSET:  location  of  the  domain  in  buffer; 

DCBTYFE:  One  of  the  four  allowable  types  types; 

TJ^G:  when  finding  an  extremum,  this  indicates  whether  a 

maximum  or  minimum  is  sought, 

iii.  FILE_INFO:  Each  entry  in  this  table  describes  the  UNIX  file 
associated  with  a  record  type. 

UNIX_FILEN^ME:  the  internal  name  corresponding  to  the 

interpreter's  number  for  the  record  type; 

EYTE^EECLEN:  the  length  of  a  record  in  bytes; 

NUMEECS:  the  total  number  of  records; 

NDM_SOFTED_EECOEDS :  the  number  of  sorted  records  in  the  file; 
the  last  (NUMEECS  -  NUM_SOETED_ EECCEES)  of  the  file 
constitute  the  overflow  area; 

ECMINEEX:  the  address  in  SYSTEM_INFO  of  the  domain 

information; 

EEC_CCUNTEE:  a  record  counter  for  the  given  file,  used  in 

sequential  (record)  retrieval; 

FIIESEES:  the  system's  i-number; 

EAW_EEAE_EUFB ;  a  pointer  to  a  buffer  area  into  which  packed 
domain  values  are  read; 

EUFE_FTE:  a  pointer  to  a  buffer  area  in  which  domain  values 

are  unpacked  (this  is  generally  passed  to  the 
interpreter  level  as  the  result  of  a  command) . 


4.3.2  Interpreter  Interface 

The  following  routines  are  used  by  the  interpreter  level 
to  evaluate  queries. 
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PEIP_FIIE:  This  routine  takes  a  record  type  name  as  input.  It 
brings  the  domain  and  file  information  table  data  from  sys_INFO 
into  main  memory.  If  the  upper  bound  on  the  number  of  open  files 
has  not  yet  been  reached,  then  the  record  type  is  opened  and  a 
pointer  to  the  domain  information  table  is  returned. 


I1EMINATE_DSE: 
The  associated 
ONIX  file  is 
are  freed. 


This  routine  terminates  activity  on  a  record  type, 
file  directory  entry  is  removed,  the  associated 
closed  and  all  domain  and  file  information  tables 


PECJ_ECBi^IN:  This  routine  takes  a  record  type  or  a  subset  of  a 
record  type  and  a  domain  name  as  input.  It  produces  a  BE 
containing  the  projection  of  the  (subset  of  the)  record  type  on 
tte  given  domain. 

First,  the  B-tree  level  is  called  to  initialize  a  new  BE 
to  hold  the  projection.  Next,  if  the  domain  has  an  inverted  file 
of  type  i-iv  (cf .  section  3.5),  then  the  B-tree  level  can  simply 
extract  the  values.  Otherwise,  the  records  in  the  record  type 
must  be  fetched,  one  at  a  time,  and  inserted  into  the  E-tree. 
The  E-tree  level  elminiates  duplicate  values. 

DCMAIN_ACCESS_INIT:  This  routine  prepares  Mini-omega  to  accept  a 
seguence  cf  read  requests  for  a  single  domain.  It  first  obtains 
the  information  it  needs  regarding  the  domain  from  the  domain 
table.  Then  the  relevant  read  pointer  is  fixed  at  the  first 
record  of  the  record  type.  Finally,  two  buffers  are  allocated 
(all  buffers  used  by  Mini-omega  are  allocated  on  one  stack) .  One 
buffer  holds  the  record  as  it  comes  in  from  the  disk;  the  second 
holds  the  expanded  domain  value,  properly  aligned.  Mini-omega 
can  now  receive  ACCESS_DOMi^.IN  cmmands  on  this  record  type  domain. 


ACCESS_DOMAIN :  This  function  is  used  (by  the  interpreter  level) 
to  sequentially  extract  a  particular  domain  value  from  any  given 
record  type.  DOMAIN_ACCESS_INIT  must  have  been  previously  called. 
Since  the  read  pointer  will  have  been  properly  aligned,  a  read  is 
immediately  done  into  the  allocated  buffer.  Since  material  on 
disk  may  be  compressed,  the  domain  is  immediately  expanded  into  a 
second  buffer.  The  function  returns  the  index  of  the  record 
accessed  or  an  end  of  file  mark. 


ENE_DOMAIN_ACCESS :  This  routine  is  the  complement  of 
DCMAIN_ACCESS_INIT  and  MD1TI_DCMAIN_ACCESS„INIT.  It  releases 
buffer  space  that  was  allocated  by  the  domain  accessing  routines. 
We  have  adopted  the  convention  that  the  first  domain's 
information  table  entry  (for  a  multi-domain  access)  and  the 
domain  information  table  entry  (for  a  single-domain  access) 
contain  in  the  COMEINEI)_BYTE_WIDTH  field  the  total  buffer  length. 
The  buffer  size  can  therefore  be  obtained  by  a  table  look-up  and 
then  released. 


EY_EE_ACCES£_DCMAIN:  This  function  is  the 

except  that  it  extracts  domains  from 
prescribed  in  a  particular  EE.  This  of 


same  as  ACCESS_DOMAIN , 
only  those  records 
course  means  using  the 
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interface  to  the  E-tree  level.  The  returned  value  is  the  same  as 
in  access_eom;^.in. 

MOLTI_DOMAIN_ACCES£_INIT:  This  function  is  analogous  to 
ECMAIN_ACCESS_INIT,  only  for  multiple  domain  access.  It  is 
essentially  the  same  as  DOMAIN_ACCESS_INIT,  except  for  the  buffer 
allocation  for  the  expanded  domain  values.  To  avoid  addressing 
errors,  each  domain  must  be  word  aligned  in  the  buffer.  Hence, 
when  constructing  the  buffer  and  assigning  offsets,  the  procedure 
must  pad  out  fields  where  necessary  to  align  them  on  word 
boundaries . 

CCLLECT_DOMAINS ;  This  function  is  the  multiple  domain  counterpart 
of  ACCESS_DOMAIN.  It  is  somewhat  more  complex  than 
ACCESS__DOMAIN,  since  the  read  pointer  must  he  successively 
readjusted  to  the  first  byte  of  each  domain  to  be  assigned.  The 
routine  STOEE_DOMAINS  uses  the  preassigned  domain  offsets  to 
actually  place  the  expanded  domain  values  into  the  previously 
allocated  buffer. 

E^_BE_COLIECT_DOMAINS ; 

This  extends  COLLECT_DOMAINS  to  access  records  through  pointers 
stored  in  a  EE  (cf.  EY^EE.ACCESS^EOMAINS) . 

FIND_IXTEEMOM:  This  function  is  used  by  the  interpreter  to  find  a 
domain  extremum,  with  an  argument  tag  to  indicate  whether  a 
maximum  or  minimum  is  desired.  The  extremum  may  be  relative  to 
the  entire  record  type,  or  only  to  a  subset  of  it  specified  by  a 
EE.  In  the  former  case,  a  table  look-up  to  the  extremum  values 
table  on  SYSTEH^INFO  is  sufficient  to  obtain  the  answer.  If  a  BE 
is  involved,  then  the  records  specified  by  the  BE  are  accessed 
(using  the  data  structures  level)  to  determine  the  extremum.  In 
either  case,  the  result  is  returned  in  a  newly  allocated  buffer. 

INVEET_DCMAIN :  This  routine  interfaces  with  the  E-tree  level  to 
create  an  inverted  file  on  a  specified  domain  of  a  record  type. 
First,  the  E-tree  level  is  alerted  that  an  inversion  is  about  to 
be  formed.  The  record  type  is  initialized  for  reading  (using 
DCMAIN_ACCESS_INIT)  and  domain  values  are  successively  read 
(using  ACCESS_DCMAIN) .  For  each  value,  the  value  and  record 
number  are  inserted  (by  the  data  structures  level)  into  the  E- 
tree  that  holds  the  inverted  file. 


MELTI_VALDE_FIND :  This  function  is  the  multiple  domain  conterpart 
of  FIND_EXTEEilOM,  and  works  in  essentially  the  same  way.  A 
pointer  to  the  answer  buffer  is  returned. 


4.3.3  Eisk  Accessing  Commands 

The  following  disk  accessing  routines  are  placed 
together  in  a  separate  file  according  to  the  monitor  concept  of 
Hcare  and  Brinch  Hansen,  so  that  the  routines  accessing  a  data 
structure  are  not  scattered  throughout  the  program.  This 
technique  proved  to  be  a  very  valuable  tool  in  preliminary 
debugging . 
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GEI_FILE__IIlFO_FEOM_DISK :  This  function  brings  record  type 

inforiraticn  into  a  table  in  main  memory. 

GET_DOB_INFO_FFOM_EISK:  This  function  reads  a  block  of 

information  about  the  given  domain  from  SYSTEH__INFO. 


RE;iD_Mt]LTI_EXTBEIlDM_FECM_BISK:  The  extremum  value  for  each  of 
the  given  domains  is  read  from  the  extremum  value  table  on 
SYSTEM_INFO.  Each  extremum  value  is  expanded  and  placed  in  an 
answer  buffer,  which  is  then  returned. 

EETURN_INVERTED_FILE_INDEX:  If  the  given  domain  of  the  given 
record  type  is  inserted,  then  the  index  of  that  domain's  inverted 
file  is  returned.  If  there  is  no  entry  for  this  domain  in  the 
inverted  file  table,  then  a  special  null  value  is  returned. 

SET_FIIENJ^.ME:  This  function  converts  a  given  record  type  index 
into  an  internally  used  UNIX  filename.  If  the  request  originated 
from  the  interpreter  level,  then  the  resultant  name  is  written 
onto  SYSTEM^INFC.  If  the  data  structures  level  made  the  request, 
then  the  name  is  returned  directly. 

4.3.4  Utility  Functions 

The  following  functions  provide  internal  service 
routines  for  Mini-omega. 

CCMPAEE:  This  routine  accepts  as  arguments  pointers  to  two  memory 
areas  and  the  lengths  of  the  areas.  It  returns  a  value 
indicating  a  greater  than,  less  than,  or  equal  to  relationship  of 
the  first  area  to  the  second. 

CCMPOTEE_WIDTH :  This  function  sets  and  returns  a  given  domain's 
byte  width  as  indicated  by  its  type.  There  are  three  numeric 
types  of  fixed  length,  and  a  character  type  of  variable  length. 
The  values  0,  -1,  and  -2  are  used  to  indicate  the  numeric  types, 
while  positive  values  indicate  the  length  of  a  character  domain. 

COPY:  This  routine  copies  the  contents  of  one  buffer  into  another 
of  a  specified  length. 

DDMP__DCMi^.IN_TAELES :  This  is  a  debugging  aid,  which  simply 
displays  the  contents  of  the  current  domain  table  entries  for  the 
currently  open  file. 

DUMP_FI1E_TAELE:  This  too  is  a  debugging  routine.  It  displays 
the  file  information  table  entry  for  the  currently  open  file. 

EXPANE:  This  routine  is  used  to  word  align  the  contents  of  a 
given  input  buffer  and  transfer  it,  in  aligned  form,  to  an  output 
buffer.  The  packed  and  unpacked  length  of  the  buffer  value  are 
the  only  other  arguments  needed. 

FINB_FREE_SLOT_IN_FILE_EIEECTOEY:  This  function  finds  an  open 
slot  in  the  file  directory,  if  one  exists.  If  so,  the  file  number 
given  as  argument  is  assigned  to  that  slot  and  the  index  of  the 
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slct  is  returned.  If  the  directory  is  full,  a  "FOLL”  indication 
is  returned. 

INEEX_ IN_FILE_DIEECTOBY :  This  function  does  a  look-up  in  the 
file  directory  to  see  if  a  particular  file  is  present.  If  so,  the 
appropriafe  index  is  returned.  If  not,  a  null  value  is  returned. 

SEIOP_FILE_ACCESS ;  Since  many  files  may  need  to  be  concurrently 
open  for  handling  queries  across  several  record  types,  this 
routine  is  used  to  set  the  current  file  and  domain  information 
table  pointers  to  correspond  to  that  information  for  the  given 
record  type. 

STOBE__EOMAINS :  This  routine  expands  the  values  of  a  given  subset 
of  the  domains  in  a  given  input  buffer  into  an  output  buffer. 

4.4  STEDCTDEE  SYSTEM 

The  data  structure  system  uses  the  E-tree  table  and  the 
system  EE  table  as  previously  described  in  table  2.  Storage 
administration  of  the  tables  and  of  the  UNIX  files  that  contain 
E-tree  pages  is  done  in  a  uniform  manner.  Holes  (i.e.,  entries 
or  pages  no  longer  required)  are  chained  into  a  list  of  available 
space  ty  pointers;  there  is  an  available  space  area  at  the  bottom 
of  a  table  or  the  higher  part  of  a  file. 

The  main  memory  tables  are  managed  in  the  same  way  as 
Mini-omega.  Some  space  is  allocated  in  fixed  size  regions.  When 
mere  space  is  required,  it  is  taken  from  the  available  space  area 
and  chained  to  the  fixed  table  part. 

In  addition  to  the  information  in  the  E-tree  table,  each 
E-tree  page  contains  a  header  that  includes  the  type  of  the  node, 
the  number  of  keys  in  the  node,  the  number  of  bytes  of 
information  in  the  node  (not  including  the  header)  and  the  file 
addresses  of  the  left  brother  node,  the  right  brother  node,  and 
the  parent  node. 

There  are  two  types  of  nodes  to  consider:  leaves  and 
non-leaves.  Non-leaf  nodes  are  the  same  for  all  types  of  B- 
trees.  Basically,  non-leaf  pages  contain  alternating  values  of 
file  addresses  and  keys.  There  are  two  cases  to  consider  here. 
The  file  address  may  be  the  file  address  of  a  leaf  node.  Then  the 
next  adjacent  key  in  the  non- leaf  page  will  be  the  last  key  of 
the  leaf  node.  Alternatively,  the  file  address  may  be  the  file 
address  of  a  non-leaf  node.  In  this  case  the  next  adjacent  key  in 
the  non-leaf  page  is  the  largest  key  in  the  subtree  rooted  by  the 
node  that  is  stored  at  the  file  address.  The  first  and  last 
entries  in  a  non-leaf  node  are  always  file  addresses.  (At  this 
point  in  the  implementation,  the  right  brother  node  and  left 
brother  node  file  addresses  have  not  been  needed  in  non-leaf 
nodes.  They  will  probably  be  removed.) 

Leaf  nodes  are  structured  differently  for  EEs  and  the 
different  types  of  inverted  files.  In  the  leaf  nodes  of  a  BE 
only  keys  are  stored.  In  the  leaf  nodes  of  an  inverted  file,  the 
key  is  stored  followed  by  some  number  of  record  numbers.  The 
exact  structure  is  different  for  each  type  of  inverted  file. 
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In  the  following,  we  will  mainly  describe  the  procedures 
that  support  the  creation  of  B-trees  and  the  sequential 
processing  of  E-trees,  since  they  are  the  most  interesting  parts. 

4.4.1  E-tree  Creation 

This  module  consists  of  one  main  routine  CEEATE_B_TREE 
which  sets  up  a  new  E-tree.  It  calls  several  other  routines 
which  perform  specific  functions. 

CEEATE_E_TEEE  (TYPE, KEY^IENGTH, FILLING_FACTOE, INFO_FIELD_IENGTH, 
RT_NUHEEE ,CVEEFIOH_FACTOE,UNDEEFLCH_FACTOB) :  This  rout ine“cre ates 
a  new  E-tree  and  returns  its  name.  A  B-tree  is  created  by  adding 
an  entry  containing  the  new  E-tree* s  characteristics  to  the 
appropriate  table  (the  system  EE  table  for  system  EEs  or  the  B- 
tree  table  for  permanent  EEs  and  inverted  files) .  The  index  in 
the  table  is  translated  into  a  unique  name  and  returned.  The  new 
E-tree  consists  only  of  a  header  in  a  table.  For  system  BEs  a 
root  page  is  allocated  but  nothing  is  written  on  it.  For 
permanent  EEs  and  inverted  files  no  rcot  page  is  allocated. 

The  parameters  correspond  to  entries  in  the  E-tree 
header,  described  in  table  2.  Since  different  information  is 
stored  in  the  headers  of  different  types  of  B-trees,  not  all  the 
parameters  need  to  be  included  in  creating  all  types  of  E-trees. 

For  system  EEs  only  the  TYPE  and  KEY_LENGTH  parameters 
need  to  be  specified.  The  routine  SET_SYS_EE_TAELE__E KTEY  is 
called,  which  creates  a  system  EE.  The  name  of  the  EE,  which  is 
its  index  in  the  system  BE  table,  is  returned.  The  key  length, 
page  address  of  the  root  node,  and  number  of  keys  of  the  E-tree 
are  stored  in  the  EE  table. 

Permanent  BEs  and  inverted  files  are  stored  in  the  same 
table,  but  not  all  of  the  information  needed  for  inverted  files 
is  needed  for  permanent  BEs.  For  permanent  EEs  the  TYPE, 
KEY_LENGTB,  and  FILLING_FACTOE  must  be  specified.  For  inverted 
files  the  TYPE,  KEY_LENGTH,  FIILING_FACTOE,  INFC_FIE1D_LENGTH  and 
RT_NUMEEE  must  be  specified.  For  both  BEs  and  inverted  files,  the 
number  of  keys  is  set  to  zero  and  the  page  address  of  the  root 
node  is  set  to  ECF,  since  no  root  node  is  allocated  yet, 
OVEEFLCW_FACTOE  and  DNCERF10W_FACT0E  are  optional  for  both 
permanent  EEs  and  inverted  files.  If  they  are  included  in  the 
creation  of  a  permanent  BE,  zeroes  must  be  included  as  values  for 
IHFO_EIELE_LENGTH  and  RT^NUMEEE.  At  this  time  the  update 
routines  have  not  been  implemented  and  no  decisions  have  been 
made  concerning  optimal  overflow  and  underflow  factors. 
Therefore,  the  parameters  are  optional  to  allow  the  greatest 
flexibility.  The  default  values  have  been  set  arbitrarily  for 
now . 


If  the  TYPE  is  "permanent  EE",  then  a  name  is  assigned 
and  the  B-tree  is  stored  on  the  BE^FILE,  file,  which  holds  the 
bodies  of  all  permanent  EEs.  If  the  CVEEFLOW_FACTOE  and 
ONCER F10W_FACTCE  parameters  are  included,  the  values  are 
assigned.  Otherwise,  default  values  are  used. 
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If  the  TYPE  is  ore  of  the  inverted  file  types  the 
INFO_EIELD_LENGTH  and  ET__NOMBEE  are  assigned  to  the  B-tree  table 
entry.  The  routine  SET_IF_FILE  (BT_NUMBEE,HEADER_F)  assigns  to 
the  entry  the  file  name  for  inverted  files  of  record  type 
RT_NUMEEE.  The  OVEEFLOW_FACTOE  and  ONDERFLOW_FACTOE  are  assigned 
as  with  permanent  EEs. 

4.4.2  Eestruction  of  E-trees 

This  module  consists  of  one  main  routine, 
DESTEOY_E_TREE  (NAME)  which  destroys  the  B-tree  NAME,  and  several 
routines  called  by  DESTEO Y_E_TEEE. 

DESTEOY_E_TEEE  (NAME) :  This  routine  destroys  a  E-tree  by  freeing 
all  the  file  pages  used  to  store  the  B-tree  and  removing  the 
table  entry  for  the  B-tree.  If  the  E-tree  is  a  system  EE,  then  it 
is  destroyed  by  calling  CLEAE_Sys_EE (NAME) ,  which  destroys  system 
EEs.  If  the  E-tree  is  a  permanent  BE  or  inverted  file,  then  the 
header  for  the  E-tree  is  freed  from  the  B-tree  table.  The  file 
pages  used  to  store  the  E-tree  are  freed  by  calling 
EELEASE_SONS_CF  which  frees  all  the  pages  which  hold  nodes  that 
are  descendants  of  the  root  node  of  the  E-tree.  The  page  holding 
the  root  node  is  freed  directly  by  calling  FEEE_EE_FI1E_PAGE. 

CLEAE_SYS_EE  (NAME) :  This  routine  frees  all  the  file  pages  used  to 
store  the  E-tree  representing  the  system  EE,  NAME,  and  deletes 
the  entry  in  the  system  EE  table.  The  file  pages  used  to  store 
the  E-tree  are  freed  by  calling  EELEASE_SCNS_CF ,  which  frees  all 
descendants  cf  the  root  of  the  E-tree. 

EELEASE_SCNS_OF  (FILE_NUMBER,FILE_PAGE,KEY_LENGTH)  :  This  recursive 
procedure  frees  all  the  descendants  of  a  E-tree  node  stored  at 
FILE_PAGE.  If  the  node  stored  at  FILE_PAGE  is  a  leaf  node, 
nothing  is  done,  since  it  has  no  descendants  to  free.  If  the  node 
is  a  non-leaf  node,  it  is  necessary  to  free  the  descendants  on 
each  branch.  In  this  case,  the  procedure  iterates  through  all 
entries  in  the  page,  (recursively)  releasing  each  one. 

FBEE_EE_FILE_PAGE  (PAGE_ADER,FILE_NUMBEE)  :  This  routine  takes  the 
E-tree  page  PAGE_AEDR  or  FILE_NDMEEE  and  releases  it  to  the  list 
of  available  space  for  EEs. 


4.4.3  Sequential  Insertion  Routines 

This  module  is  made  up  of  three  main  routines  that 
insert  a  series  of  sorted  keys  into  a  BE.  The  EE  must  be 
initially  empty  and  the  series  of  keys  must  be  inserted  in  sorted 
order.  The  main  routines  in  the  module  are:  EE_INSEET_SEQ_INIT 
initiali2es  the  process;  EE_KEY_INSEET  is  called  for  each  key  to 
be  inserted;  END_E_I_S ()  ends  the  process. 

BE_INSEET_SEQ_INIT  (NAME) :  This  routine  initializes  the  sequential 
inserticn  process  of  the  E-tree  NAME,  which  must  be  an  empty 
basic  element.  Information  needed  for  the  process  is  obtained  by 
calling  GET_SEQ_INSERT_INFO .  Two  global  page  buffers,  PAGE1  and 
PAGE2,  are  allocated,  and  the  write  pointer  is  initialized  to 
pcint  to  the  first  word  of  PAGE1. 
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GET_SEQ_INSEET„INTO(NABE) ;  This  routine  initializes  basic 
inforiaticn  to  control  the  filling  process,  including;  the  key 
count  (initially  zero) ;  total  length  (initially  zero) ;  filling 
factor  (100%  for  system  BEs,  or  E-tree  table  value  for  permanent 
EEs) ;  key  length  (from  E-tree  table) ;  and  root  pointer  (from  B- 
tree  table)  . 

EE_KE Y_INSEET  (KE Y) ;  This  routine  adds  the  key  at  address  KEY  to 
PJ^GEI  and  if  Pi^.GEI  is  full  the  page  is  added  to  the  E-tree.  The 
write  pointer,  key  count,  and  total  length  are  incremented.  If 
Pi^GEI  is  full  (i.e.,  it  has  reached  its  filling  factor),  then  the 
page  is  added  to  the  E-tree  by  calling  SET_PAGE_AND_CHECK  and 
Pi^GEI  is  emptied  ty  reinitializing  the  write  pointer. 

SET_PAGE_ANE_CHECK 0 :  This  procedure  prepares  PAGE1  to  be 
inserted  into  the  E-tree.  The  header  information  including  the 
node  type  (=  ‘leaf*),  key  count,  and  byte  count  is  stored  in 
Pi^GEI.  A  EE  file  page  is  allocated  and  added  to  the  tree  by 
calling  ADE_PAGE_TC_TEEE. 

ArE_PAGE_TC_TEEE  (Pi^GE)  ;  This  routine  copies  PAGE1  into  the  file 
page  PAGE  and  adds  it  to  the  B-tree.  The  global  variable  SON 
irdicates  the  last  leaf  page  added  to  the  tree. 

If  the  tree  is  empty  (SON=EOF) ,  then  PAGE1  is  copied 
into  PAGE  and  SCN  is  set  to  PAGE. 

If  the  tree  is  not  empty,  SON  is  read  into  PAGE2.  The 
right  pointer  of  SCN  is  set  to  PAGE  and  the  left  pointer  of  PAGE 
is  set  to  SON.  The  last  key  in  SON  is  copied  into  the  buffer  KEY. 
The  variable  FATHEE  is  set  to  the  father  of  SON. 

If  FATHEE  is  EOF,  then  SCN  is  the  root  of  the  tree.  In 
this  case,  since  SCN  is  both  root  and  leaf,  then  it  must  be  the 
only  page  in  the  E-tree.  To  add  another  page,  SON  is  replaced  as 
root  by  calling  EOCT^TEEE.  PAGE2  is  written  back  and  its  FATHEE 
is  moved  into  FAGE2.  FATHEE  is  set  to  the  new  root  which  is 
returned  by  the  routine. 

If  FATHEE  is  not  EOF,  it  is  necessary  to  find  the  father 
of  PAGE.  SCN  is  written  back  from  PAGE2  (with  the  new  right 
pointer) ,  and  FATHEE  is  read  into  PAGE2.  Then  it  is  necessary  to 
climb  through  the  tree  and  find  a  page  that  can  take  another  son 
and  extend  a  branch  down  to  the  leaf  level.  This  may  involve 
adding  a  new  root  if  the  E-tree  is  full  at  all  levels. 

If  PAGE2  is  not  full,  then  PAGE  is  added  as  sone  of 
PAGE2  by  calling  AED_SON. 

If  PAGE2  is  full,  a  counter  I  is  incremented,  and  if  the 
father  of  the  page  in  PAGE2  exists,  it  is  read  into  PAGE2.  If  it 
does  not  exist,  FATHEE  is  replaces  as  root  by  calling  BOCT_TEEE. 
(FATHEE  is  the  page  in  PAGE2  at  all  times.)  At  the  end  of  this 
climbing  process  I  is  the  number  of  levels  the  branch  from  FATHEE 
must  be  extended  before  a  leaf  can  be  added.  The  branch  is 
extended  by  adding  a  new  son  from  FATHEE,  resetting  FATHEE  to  be 
the  new  son  and  decrementing  I,  continuing  until  I  is  zero.  A  new 
son  is  added  by  obtaining  a  EE  page  and  calling  ADD _ SON. 
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ECCT_TBEE  (P^^.GZ ,FILE_NUMEEB)  :  This  routine  takes  Pi^.GE,  which  is 
currently  the  root  of  the  E-tree,  and  makes  it  the  son  of  the  new 
rcct  page,  A  free  BE  page  is  obtained  to  be  the  new  root.  The 
father  pointer  of  PAGE  is  set  to  the  address  of  the  new  root. 
PAGE2  is  written  tc  PAGE.  The  page  header  information  for  the  new 
rcct  is  filled  into  PAGE2,  and  the  address  of  the  new  root  is 
returned . 

ADE_SCN  (NEW_SON,FATHEE,KEY,FILE_NDMBEB)  :  This  routine  adds 
NEK^SON  as  a  son  of  FATHEF.  The  address  of  NEW_SON  and  its 
largest  key  value  are  added  to  PAGE2  (which  holds  FATHER)  and  the 
write  pointer  is  incremented.  The  key  count  and  length  of  PAGE2 
are  incremented  in  PAGE2's  header.  PAGE2  is  rewritten  back  to 
FATHER. 

EKE_E_I_S  0 :  This  routine  adds  the  last  page  to  the  tree  if  it  is 
net  empty  and  deallocates  the  work  space.  In  addition,  the  root 
page  and  key  count  fields  of  the  B-tree  table  entry  or  system  BE 
table  entry  (whichever  is  appropriate)  are  updated. 


5.  SUMMARY 

In  this  report  we  have  presented  the  detailed  design  of 
a  the  relational  data  base  system  Omega.  The  main  features  of 
the  design  include  the  clear  separation  of  file  optimization  from 
access  path  optimization,  the  use  of  access  path  structures  that 
are  independent  of  the  file  access  method,  and  the  use  of  a 
system-oriented  guery  language  at  the  system  interface.  The 
system  is  highly  modular.  It  is  a  demonstration  that  principles 
of  structured  system  design  can  be  applied  to  small  data  base 
management  systems. 
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